Executive Summary

In this project, our team of data modeling students has meticulously developed a sophisticated model for predicting real estate prices in King County, USA, aiming to inform and guide high-level executives and investment leaders in their strategic decision-making. This model, built from a comprehensive dataset encompassing various property attributes, employs advanced techniques such as linear regression, regression trees, and neural networks, ensuring accuracy and versatility.

The primary objective of the model is to serve as a pivotal tool for investment strategy, market analysis, and policy formulation. It is designed to align with regulatory standards and internal compliance requirements, ensuring ethical and responsible use of data. However, it’s important to note that the model’s applicability is specific to King County’s real estate market, and its predictive accuracy is contingent upon the ongoing integration of current market data.

While our model represents a significant advancement in data-driven real estate analysis, it is essential to recognize its limitations. These include potential biases in historical data and the model’s limited generalizability beyond the regional context of King County. Regular updates and adaptations of the model are recommended to maintain its relevance and accuracy in a dynamic market environment.

Overall, this model stands as a testament to the power of data science in enhancing market understanding and investment strategies, offering valuable insights for executive decision-making and organizational growth in the real estate sector.

I. Introduction

In the ever-evolving landscape of real estate, the ability to accurately predict house prices is invaluable. This project, undertaken by a dedicated team from Harvard Extension School, delves into this realm, focusing on King County, USA. The motivation behind our work is twofold: firstly, to provide a robust predictive tool for potential investors and market analysts, and secondly, to contribute to the academic understanding of real estate market dynamics.

Our approach is grounded in the analysis of the ‘KC_House_Sales’ dataset, a comprehensive collection of house attributes within King County. The dataset, rich in detail, includes features such as square footage, the number of bedrooms, and geographical location, etc., all of which are pivotal in determining house prices. The initial phase of our project involved a thorough data cleaning process. This step was crucial in ensuring the integrity of our analysis, involving the removal of irrelevant variables, handling missing values, and converting data types for consistency.

Once the data was prepared, we embarked on a methodological journey, exploring various statistical and machine learning techniques. Our primary method, linear regression, served as a foundation model, offering insights into the direct relationships between house features and their prices. However, recognizing the complexity of the real estate market, we expanded our toolkit to include more advanced models like neural networks, decision trees, and support vector machines (SVM). Each of these methods brought a unique perspective and depth to our analysis, enabling us to capture non-linear relationships and complex interactions between variables.

The structure of our project involved splitting the dataset into two parts: 70% for training and 30% for validation. This split was strategically chosen to ensure the robustness of our models against unseen data, a critical aspect of predictive modeling. The training phase was an iterative process, where each model was refined through a series of evaluations and adjustments. In doing so, we aimed to strike a balance between model complexity and predictive accuracy.

Our exploratory data analysis revealed several key insights. We observed that certain features, such as square footage and location, had a significant impact on house prices. This initial observation guided our feature selection and engineering process, where we developed new variables that could potentially enhance the model’s performance.

As we progressed, the need for a rigorous evaluation framework became apparent. To this end, we employed various metrics such as MSE, R-squared and Pseudo-R squared values. These metrics were instrumental in assessing the performance of our models, both on the training and validation sets. The final selection of our primary model, which we refer to as the “champion” model, was based on a comprehensive evaluation of these metrics.

In conclusion, this project represents a significant endeavor in the field of predictive analytics for real estate. Through a blend of traditional statistical methods and advanced machine learning techniques, we have developed a model that not only predicts house prices in King County with a high degree of accuracy but also offers insights into the factors that drive these prices. As we move forward, our focus will be on refining the model, exploring new data sources, and adapting to the changing dynamics of the real estate market.

II. Description of the data and quality

In this section, we meticulously dissect the King County house sales dataset, a crucial foundation for our predictive modeling. This dataset is a rich amalgamation of diverse variables, ranging from basic house attributes like square footage and number of bedrooms, to more nuanced features such as the year of renovation and the presence of a waterfront. The quality of this dataset is paramount, as it directly influences the reliability and precision of our predictive models. We delve into a detailed assessment of the data quality, addressing aspects like completeness, consistency, and potential biases, ensuring that our analysis is built on a robust and accurate foundation.

Data Overview

The dataset presented for analysis encompasses a range of housing attributes for properties sold in King County, including the sale price, number of bedrooms and bathrooms, square footage of living and lot space, and other characteristics like the presence of a waterfront, views, and the condition and grade of the house. Each entry is timestamped, providing a date of sale, which can be instrumental in understanding market trends over time.

# Overview of the King County House Sales Dataset
df_house = read.csv("KC_House_Sales.csv")
cat("Number of NA values in dataframe:", sum(is.na(df_house)))
## Number of NA values in dataframe: 0
head(df_house)

Data Types, Categories and Cleaning

Our dataset includes a blend of continuous and categorical data types. Notably, the ‘zipcode’ and ‘waterfront’ variables are categorical despite their numeric appearance. ‘Zipcode’ represents different regions, and ‘waterfront’ is a binary indicator, thus requiring special attention during preprocessing. These variables, along with others like ‘view’ and ‘condition’, will be transformed into dummy variables to facilitate their use in our regression models.

The data cleaning process has involved removing non-informative variables such as IDs, correcting data types (e.g., transforming sale price to a numeric format and parsing dates into a usable format), and creating new variables that could reveal temporal trends (e.g., year, month, day of sale).

# The "id" column must be removed
df_house = subset(df_house, select = -id)

# The data in the column "price" must be converted to numeric
df_house$price <- as.numeric(gsub("[\\$,]", "", df_house$price))

# Cleanup of "date" and creation of "year", "month", and "day" columns
df_house$date <- as.POSIXct(df_house$date, format = "%Y%m%d")
df_house$year <- as.numeric(format(df_house$date, "%Y"))
df_house$month <- as.character(format(df_house$date, "%m"))
df_house$day <- as.numeric(format(df_house$date, "%d"))

# Collect columns that are numeric only
ndf_house = df_house[sapply(df_house, is.numeric)]
# This action ends up the column "date" from the dataset

head(df_house)

Categorical Variables and Age Analysis

Renovation Indicator Variable

A binary variable named ‘renovated’ was introduced to indicate whether a property has undergone renovation. This variable is set to 1 if the ‘yr_renovated’ field is not zero, signifying that the property has been renovated at least once. Otherwise, it is set to 0, indicating no renovation. This distinction provides a straightforward way to assess the impact of renovations on property values.

# Creating a new variable 'renovated' 
# 1 if the property has been renovated (yr_renovated != 0), 0 otherwise
df_house$renovated = ifelse(df_house$yr_renovated != 0, 1, 0)

Property Age Calculation

This is a tentative way to consider “age since last renovation” and see if there’s a correlation between the time a house was last built/renovated on its selling price. A new variable called ‘age’ was calculated to represent the current age of each property. If a property was renovated, its age is the difference between 2023 and the renovation year (‘yr_renovated’). If not renovated, the age is the difference between 2023 and the year the house was built (‘yr_built’). This variable helps in understanding the effect of property age and recent renovations on house prices.

# Creating 'age' column
df_house$age = ifelse(df_house$renovated == 1, 
                      2023 - df_house$yr_renovated, 
                      2023 - df_house$yr_built)

head(df_house)

Reinterpreting Zipcode as a Categorical Variable

Transformation of the ‘zipcode’ variable from numerical to categorical data. Typically, zip codes are mistakenly treated as numerical values in datasets. However, they are categorical by nature, representing distinct geographical areas rather than possessing any inherent numerical relationship. This conversion is crucial for our analysis, as it allows us to use the zip code as a qualitative variable in our models, acknowledging its role in distinguishing different regions and their unique characteristics in housing prices. The R code snippet demonstrates the simple yet essential process of converting ‘zipcode’ to a character type, ensuring it is correctly utilized in subsequent analyses.

# Convert Zipcode to a Categorical variable (char) instead of number
df_house$zipcode = as.character(df_house$zipcode)

head(df_house)
df_house_original <- df_house
df_house_original[c("date")] <- list(NULL)

Enhancing Data Integrity: Addressing Redundancies and Correlations

In this pivotal section, we undertake a rigorous data cleaning process, focusing on eliminating unnecessary variables and managing highly correlated variables. This step is vital for enhancing the integrity and validity of our dataset, ensuring that our predictive models are based on the most relevant and independent variables. By methodically addressing these aspects, we aim to create a more streamlined and efficient dataset, thereby laying a solid foundation for robust and accurate predictive modeling in our real estate analysis.

Correlation Analysis

In the “Correlation Analysis” section, we delve into examining the relationships between various features of the dataset and the target variable, ‘price’. This involves creating a correlation matrix and corresponding plots to visually and statistically identify how different variables are related to each other and to house prices.

# Create a correlation Matrix
cor_matrix = cor(ndf_house)
correlation_df = as.data.frame(cor_matrix)

# New dataframe with variables 'highly' (>0.2) correlated with price
x_high = subset(ndf_house, 
                select = c(view, waterfront, sqft_basement, sqft_living, 
                           sqft_living15, sqft_above, grade, bathrooms, 
                           bedrooms, floors, lat))

# Create a Heatmap 
pheatmap(cor_matrix,
         color = colorRampPalette(c("#6baed6", "white", "#ff3366"))(20),
         main = "Correlation Matrix Heatmap",
         fontsize = 8,
         cellwidth = 15,
         cellheight = 11,
         display_numbers = TRUE
)

Analysis: Following the generation of the correlation plot, a thorough analysis is vital. For example, high correlations between ‘sqft_living’, ‘grade’, ‘sqft_above’, and ‘price’ indicate a strong linear relationship, suggesting that larger homes, built with a high-quality level of construction and design, and more above-ground square footage, tend to be more expensive. These variables could serve as key predictors in a pricing model. We will explore them further int he next model development process. However, it’s essential to note that correlation does not imply causation. Variables like ‘zipcode’, now treated as categorical, will not be represented in this correlation matrix, reminding us to consider geographical influences separately. Additionally, observing any potential multicollinearity between predictors is crucial, as it can affect the reliability of the regression model. Finally, interpreting these correlations within the context of the real estate market in King County provides insights into local housing trends and factors influencing house prices.

Unraveling High Correlation in Data Variables

This section focuses on identifying highly correlated variables within the King County house sales dataset. Through an R script, we systematically extract numeric variables and construct a correlation matrix, applying a threshold to spotlight significant correlations. We aim to reveal pairs of variables with a correlation coefficient exceeding 0.8, signifying strong linear relationships. This analysis is crucial in understanding interdependencies among variables, guiding us in avoiding multicollinearity in our predictive models and ensuring their statistical integrity and interpretability. (Kutner et al. 2005 )

# Identifying highly correlated variables

# Retain only the numeric columns in the dataset
df_house_cor <- df_house[sapply(df_house, is.numeric)]

# use 'complete.obs' to handle missing values
corr_matrix <- cor(df_house_cor, use = "complete.obs")

threshold <- 0.8
high_corr <- which(abs(corr_matrix) > threshold, arr.ind = TRUE)
high_corr <- high_corr[high_corr[, 1] < high_corr[, 2], ]

# Find highly correlated predictors
for (pair in 1:nrow(high_corr)) {
    row <- high_corr[pair, "row"]
    col <- high_corr[pair, "col"]
    cat(names(df_house_cor)[row], "and", names(df_house_cor)[col], 
        "have a correlation of", corr_matrix[row, col], "\n")
}
## sqft_living and sqft_above have a correlation of 0.8765966 
## yr_renovated and renovated have a correlation of 0.9999685 
## yr_built and age have a correlation of -0.9099238

Streamlining the Dataset for Enhanced Analysis

This section of the analysis is dedicated to refining the dataset by removing variables that are redundant or unnecessary for our modeling objectives. Specifically, we remove columns such as ‘sqft_living’, ‘yr_renovated’, ‘yr_built’. This pruning is an essential step in data preparation, ensuring that our dataset is lean and focused, which helps in improving the efficiency and accuracy of our predictive models. By eliminating these variables, we aim to enhance the clarity and relevancy of our data analysis.

# Remove redundant, unnecessary columns from dataset. 
df_house[c("date","sqft_living", "yr_renovated", 
           "yr_built")] <- list(NULL)

Initial Statistical Data Summary

The dataset from King County includes 21,613 observations with 22 variables related to house sales. Variables include continuous data like price, square footage, and lat/long coordinates, and categorical data such as bedrooms, floors, and waterfront status. The price ranges from $75,000 to $7,700,000, with a mean of $540,088. Houses range from 0 to 33 bedrooms, reflecting diverse property types. The dataset also contains binary and ordinal variables, such as view and condition, that require dummy coding for analysis.

str(df_house)
## 'data.frame':    21613 obs. of  21 variables:
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ zipcode      : chr  "98178" "98125" "98028" "98136" ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
##  $ year         : num  2014 2014 2015 2014 2015 ...
##  $ month        : chr  "10" "12" "02" "12" ...
##  $ day          : num  13 9 25 9 18 12 27 15 15 12 ...
##  $ renovated    : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ age          : num  68 32 90 58 36 22 28 60 63 20 ...
summary(df_house)
##      price            bedrooms        bathrooms        sqft_lot      
##  Min.   :  75000   Min.   : 0.000   Min.   :0.000   Min.   :    520  
##  1st Qu.: 321950   1st Qu.: 3.000   1st Qu.:1.750   1st Qu.:   5040  
##  Median : 450000   Median : 3.000   Median :2.250   Median :   7618  
##  Mean   : 540088   Mean   : 3.371   Mean   :2.115   Mean   :  15107  
##  3rd Qu.: 645000   3rd Qu.: 4.000   3rd Qu.:2.500   3rd Qu.:  10688  
##  Max.   :7700000   Max.   :33.000   Max.   :8.000   Max.   :1651359  
##      floors        waterfront            view          condition    
##  Min.   :1.000   Min.   :0.000000   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.000  
##  Median :1.500   Median :0.000000   Median :0.0000   Median :3.000  
##  Mean   :1.494   Mean   :0.007542   Mean   :0.2343   Mean   :3.409  
##  3rd Qu.:2.000   3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.000  
##  Max.   :3.500   Max.   :1.000000   Max.   :4.0000   Max.   :5.000  
##      grade          sqft_above   sqft_basement      zipcode         
##  Min.   : 1.000   Min.   : 290   Min.   :   0.0   Length:21613      
##  1st Qu.: 7.000   1st Qu.:1190   1st Qu.:   0.0   Class :character  
##  Median : 7.000   Median :1560   Median :   0.0   Mode  :character  
##  Mean   : 7.657   Mean   :1788   Mean   : 291.5                     
##  3rd Qu.: 8.000   3rd Qu.:2210   3rd Qu.: 560.0                     
##  Max.   :13.000   Max.   :9410   Max.   :4820.0                     
##       lat             long        sqft_living15    sqft_lot15    
##  Min.   :47.16   Min.   :-122.5   Min.   : 399   Min.   :   651  
##  1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490   1st Qu.:  5100  
##  Median :47.57   Median :-122.2   Median :1840   Median :  7620  
##  Mean   :47.56   Mean   :-122.2   Mean   :1987   Mean   : 12768  
##  3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360   3rd Qu.: 10083  
##  Max.   :47.78   Max.   :-121.3   Max.   :6210   Max.   :871200  
##       year         month                day          renovated      
##  Min.   :2014   Length:21613       Min.   : 1.00   Min.   :0.00000  
##  1st Qu.:2014   Class :character   1st Qu.: 8.00   1st Qu.:0.00000  
##  Median :2014   Mode  :character   Median :16.00   Median :0.00000  
##  Mean   :2014                      Mean   :15.69   Mean   :0.04229  
##  3rd Qu.:2015                      3rd Qu.:23.00   3rd Qu.:0.00000  
##  Max.   :2015                      Max.   :31.00   Max.   :1.00000  
##       age        
##  Min.   :  8.00  
##  1st Qu.: 24.00  
##  Median : 46.00  
##  Mean   : 49.61  
##  3rd Qu.: 69.00  
##  Max.   :123.00
# Stat summary of the price
summary(df_house$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   75000  321950  450000  540088  645000 7700000

Graphical Analysis

Exploratory Analysis through Pairwise Scatter Plots

This subsection is dedicated to exploring the relationships between house prices and other key variables using pairwise scatter plots. These visualizations aim to unearth correlations that could indicate influential factors in housing prices within King County. By examining these relationships, we can hypothesize which features may drive property values and warrant further investigation in our predictive modeling.

The Pairwise scatter plot below shows the correlation between highly correlated variables between price and other independent variables extracted from heatmap correlation matrix.

# Pairwise scatter plot with correlation coefficients
ggpairs(ndf_house, 
        columns = c("price", "sqft_living", "bedrooms", "bathrooms", "grade"), 
        title = "Pairwise Scatter Plots with House Price")

Analysis: The pairwise scatter plots reveal varying degrees of correlation between house prices and selected features. Notably, the square footage of living space (sqft_living) and construction grade (grade) show substantial positive correlations with price, indicating that larger, higher-quality homes tend to command higher prices. Conversely, the number of bedrooms shows a weaker correlation, suggesting that while size matters, the mere number of bedrooms is less predictive of price. These insights are critical for focusing our modeling efforts on the most impactful predictors. Also, notice that each variable’s distribution is shown on the diagonal, with price notably skewed towards lower values, indicating most homes are on the more affordable end of the spectrum with fewer high-priced outliers.

Distribution of Sales Prices

This subsection aims to analyze the distribution of sales prices across the dataset. The histogram provides visual insight into the range and frequency of house prices, highlighting the market’s tendencies.

# Plot a Histogram of house sales prices
ggplot(df_house, aes(x = price)) +
  geom_histogram(bins = 30, fill = "#6baed6", color = "black") +
  labs(title = "Histogram of House Prices", x = "Price", y = "Count") +
  theme_minimal()

Analysis: The histogram below shows the distribution of house prices, revealing that most houses are in the lower price range with a significant decrease in the number of houses as the price increases, indicating a right-skewed distribution with relatively few high-priced houses. This skewness suggests that while the majority of the market consists of moderately priced homes, luxury properties significantly drive up the average price.

Variability of Housing Prices Across Bedroom Counts

This subsection will analyze the relationship between the number of bedrooms in a property and its market price. By employing a boxplot, we can visually discern the central tendency and dispersion of house prices within each bedroom category, revealing insights into how additional bedrooms could potentially affect property values.

# Boxplot for price by number of bedrooms
ggplot(df_house, aes(x = factor(bedrooms), y = price)) + 
  geom_boxplot(fill = "#6baed6", color = "black") +
  labs(title = "Boxplot of Prices by Number of Bedrooms", 
       x = "Number of Bedrooms", y = "Price") +
  theme_minimal()

Analysis: The provided boxplot depicts the distribution of house prices with respect to the number of bedrooms. It shows a general increase in median price as the number of bedrooms increases up to a certain point, after which the median price plateaus and then fluctuates. Notably, homes with an exceptionally high number of bedrooms (e.g., 11, 33) exhibit significant price variability and outlier presence. This suggests a more complex relationship where factors beyond mere bedroom count may influence the higher pricing tiers. The presence of outliers, especially in categories with fewer bedrooms, indicates exceptions to general trends, possibly due to location, property condition, or other value-adding features.

Price Variability by Construction Grade

This subsection explores the relationship between house prices and construction grades. Construction grade is an indicator of the quality and design of a building, which can significantly impact its market value. The boxplot visualizes the distribution of house prices within each grade category, revealing how price variability and central tendencies change with the grade.

# Boxplot for price by construction grade
ggplot(df_house, aes(x = factor(grade), y = price)) + 
  geom_boxplot(fill = "#6baed6", color = "black") +
  labs(title = "Boxplot of Prices by Construction Grade", 
       x = "Grade", y = "Price") +
  theme_minimal()

Analysis: The boxplot of house prices by construction grade shows that higher-grade homes tend to have a higher median price, indicating a positive correlation between construction quality and house value. Notably, the spread and range of prices increase with the grade, suggesting greater diversity in the higher-end market. Outliers are present across all grades, but are especially pronounced at higher grades, which may indicate a niche market for luxury homes with exceptional features not captured by the grade alone.

Average Price by Number of Bedrooms

In this subsection, the analysis aims to uncover how the number of bedrooms influences the average house price. This investigation will reveal market trends and preferences, offering insights into the most sought-after property types.

# Average Price analysis
average_prices <- aggregate(price ~ bedrooms, data = ndf_house, FUN = mean)
ggplot(average_prices, aes(x = factor(bedrooms), y = price)) +
  geom_bar(stat = "identity", fill = "#6baed6", color = "black") +
  labs(title = "Average Price by Number of Bedrooms",
       x = "Number of Bedrooms",
       y = "Average Price") +
  theme_minimal()

Analysis: The bar chart depicting “Average Price by Number of Bedrooms” reveals a nuanced view of the housing market. Unlike the variability presented in the “Boxplot of Prices by Number of Bedrooms,” which shows a broad price range across different bedroom counts, the bar chart focuses on average values, smoothing out extreme data points. It highlights a peak in average prices for mid-range bedroom counts, suggesting a higher market value for these properties. This pattern may reflect a balance between affordability and space that appeals to a wider demographic, contrasting with the outliers seen in the boxplot. There is a notable outlier at 33 bedrooms with a relatively low average price, which could indicate an atypical property or data error.

House Prices by Waterfront Presence

Here the analysis examines the impact of a waterfront location on house prices. This visual comparison aims to highlight any premium attached to waterfront properties, which is a common factor in real estate valuation.

# Boxplot for Outlier analysis of Waterfront vs Mountain view
ggplot(df_house, aes(x = factor(waterfront), y = price)) +
  geom_boxplot(fill = "#6baed6", color = "black") +
  labs(title = "Boxplot of House Prices by Waterfront", 
       x = "Waterfront", y = "Price") +
  theme_minimal()

Analysis: The boxplot illustrates a significant difference in the distribution of house prices based on the presence of a waterfront. Non-waterfront properties (denoted by 0) show a dense clustering of prices with fewer outliers, suggesting a more uniform pricing structure within this category. On the other hand, waterfront properties (denoted by 1) exhibit a higher median price and a much wider spread, indicating a greater variance in how much buyers are willing to pay for this luxury feature. The presence of outliers in the waterfront category also suggests that certain premium properties command exceptional prices, which are not as prevalent in non-waterfront real estate. This analysis underscores the value added by waterfront locations, which can significantly increase property values and attract a niche market willing to invest in these desirable features.

Mapping Price Hotspots: Geospatial Analysis of King County’s Real Estate

This geospatial visualization maps the distribution of house prices across King County. By plotting price data against longitude and latitude, we aim to identify regional price trends and potential hotspots of high-value properties. This analysis provides insights into how location within the county correlates with housing prices, supporting a comprehensive understanding of the real estate market dynamics.

# Scatter plot of properties with mid-point transition color
ggplot(data = df_house, aes(x = long, y = lat, color = price)) +
  geom_point(alpha = 10) +
  scale_color_gradient(low = "#6baed6", high = "red") +
  labs(title = "Geographical Distribution of House Prices in King County",
       x = "Longitude", y = "Latitude", color = "Price") +
  theme_minimal()

Analysis: The geographical map scatter plot reveals a concentration of higher-priced properties (indicated by warmer colors) along specific geographic corridors, more specifically situated around the latitude line of 47.6 and longitude between -122.25 and -122.00, which corresponds to the central and northern parts of Seattle, suggesting that certain areas command premium prices. Notably, waterfront locations and urban centers show elevated price levels. The relatively lower-priced properties per square foot, shown in colder colors, are more dispersed and located primarily south of central Seattle, extending towards Tacoma, as well as in the outlying suburban areas. It is also noticeable that along the latitudinal line around 47.4, there are pockets of high-priced properties per square foot, potentially indicating affluent neighborhoods or areas with high-value real estate. This spatial pattern underscores the importance of location in property valuation and can guide potential investments and urban development strategies. The map clearly shows a correlation between location and property value per square foot, with central urban areas exhibiting the highest values. This pattern is typical for urban centers where proximity to amenities, employment opportunities, and other socioeconomic factors drive up real estate prices. The visual representation also highlights outliers, which could be subject to further investigation to understand the factors driving their exceptional market value.

Summary: Comprehensive Data Assessment and Visual Exploration

In Section II, we dove into the analytical scrutiny of King County’s house sales dataset and thoroughly explored it, examining both continuous and categorical variables through statistical tests and extensive graphical analysis, vital for the predictive modeling accuracy. We highlighted the strong correlations between variables like ‘sqft_living’, ‘grade’, ‘sqft_above’, and ‘price’. The section progresses through statistical examinations, categorical conversions, and novel variable formulations, all while maintaining a critical eye on data integrity. Extensive graphical analyses elucidate underlying trends, from scatter plots to geospatial mappings, provided nuanced insights into the factors influencing house prices, setting the stage for advanced predictive modeling in subsequent sections, and painting a vivid landscape of the market’s intricacies. This foundational work paves the way for sophisticated modeling techniques, poised to capture the essence of the county’s real estate dynamics.

III. Model Development Process

Data Partitioning

In this crucial phase of the model development process, we strategically divide our dataset into separate subsets for training and validation purposes. Establishing a 70-30 split, we allocate the majority for model training, ensuring ample data for learning, while reserving 30% for testing, which will serve as a litmus test for our model’s predictive power in real-world scenarios.

set.seed(1023)
n<-dim(df_house)[1]
IND<-sample(c(1:n),round(n*0.7))
train.dat<-df_house[IND,]
test.dat<-df_house[-c(IND),]

dim(train.dat)
## [1] 15129    21
dim(test.dat)
## [1] 6484   21

Analysis: The partitioning results in 15,129 cases for model training, providing a comprehensive learning set that encompasses a wide spectrum of the data’s variability. The test set, with 6,484 cases, is sufficiently large to evaluate model performance and guard against overfitting. This balanced approach to data division equips us with the necessary framework to rigorously train and objectively assess our predictive model, setting a strong foundation for robust statistical analysis.

Model Fitting and Initial Evaluation

In this step, we meticulously construct a linear regression model, leveraging the training dataset to ascertain how well our predictors explain the variability in house prices. The initial model fitting is a critical juncture where we assess the significance of each predictor and the overall strength of the model. (Faraway 2014)

# Fit a linear regression model based on the training dataset
house_lm<-lm(price ~.,data=train.dat)
summary(house_lm)
## 
## Call:
## lm(formula = price ~ ., data = train.dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1143447   -69381     -322    61678  4433247 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.695e+08  1.938e+07  -8.750  < 2e-16 ***
## bedrooms      -2.467e+04  1.783e+03 -13.831  < 2e-16 ***
## bathrooms      2.161e+04  3.087e+03   7.001 2.65e-12 ***
## sqft_lot       2.614e-01  4.812e-02   5.433 5.64e-08 ***
## floors        -4.133e+04  3.765e+03 -10.977  < 2e-16 ***
## waterfront     6.172e+05  1.754e+04  35.192  < 2e-16 ***
## view           5.770e+04  2.100e+03  27.476  < 2e-16 ***
## condition      2.820e+04  2.289e+03  12.320  < 2e-16 ***
## grade          5.727e+04  2.154e+03  26.584  < 2e-16 ***
## sqft_above     2.022e+02  3.605e+00  56.081  < 2e-16 ***
## sqft_basement  1.299e+02  4.195e+00  30.973  < 2e-16 ***
## zipcode98002   3.533e+04  1.678e+04   2.106 0.035203 *  
## zipcode98003  -2.327e+04  1.534e+04  -1.517 0.129166    
## zipcode98004   7.361e+05  2.783e+04  26.451  < 2e-16 ***
## zipcode98005   2.605e+05  2.958e+04   8.806  < 2e-16 ***
## zipcode98006   2.496e+05  2.428e+04  10.281  < 2e-16 ***
## zipcode98007   2.192e+05  3.059e+04   7.166 8.08e-13 ***
## zipcode98008   2.241e+05  2.918e+04   7.679 1.71e-14 ***
## zipcode98010   1.103e+05  2.593e+04   4.256 2.10e-05 ***
## zipcode98011   6.727e+04  3.784e+04   1.778 0.075454 .  
## zipcode98014   1.381e+05  4.190e+04   3.296 0.000982 ***
## zipcode98019   8.090e+04  4.095e+04   1.976 0.048197 *  
## zipcode98022   4.778e+04  2.281e+04   2.095 0.036171 *  
## zipcode98023  -4.734e+04  1.399e+04  -3.383 0.000718 ***
## zipcode98024   1.598e+05  3.681e+04   4.342 1.42e-05 ***
## zipcode98027   1.801e+05  2.492e+04   7.227 5.17e-13 ***
## zipcode98028   5.534e+04  3.692e+04   1.499 0.133871    
## zipcode98029   2.188e+05  2.845e+04   7.689 1.58e-14 ***
## zipcode98030   8.768e+03  1.669e+04   0.525 0.599424    
## zipcode98031   9.586e+03  1.731e+04   0.554 0.579822    
## zipcode98032  -1.440e+04  2.046e+04  -0.704 0.481657    
## zipcode98033   3.180e+05  3.171e+04  10.029  < 2e-16 ***
## zipcode98034   1.558e+05  3.388e+04   4.597 4.32e-06 ***
## zipcode98038   6.101e+04  1.876e+04   3.251 0.001151 ** 
## zipcode98039   1.085e+06  3.784e+04  28.682  < 2e-16 ***
## zipcode98040   4.769e+05  2.457e+04  19.409  < 2e-16 ***
## zipcode98042   1.986e+04  1.592e+04   1.248 0.212168    
## zipcode98045   1.567e+05  3.485e+04   4.498 6.91e-06 ***
## zipcode98052   1.987e+05  3.219e+04   6.173 6.86e-10 ***
## zipcode98053   1.760e+05  3.452e+04   5.097 3.49e-07 ***
## zipcode98055   4.062e+04  1.947e+04   2.086 0.036973 *  
## zipcode98056   8.437e+04  2.106e+04   4.006 6.21e-05 ***
## zipcode98058   2.590e+04  1.843e+04   1.405 0.159941    
## zipcode98059   7.371e+04  2.066e+04   3.568 0.000361 ***
## zipcode98065   1.201e+05  3.210e+04   3.740 0.000185 ***
## zipcode98070  -6.608e+04  2.441e+04  -2.707 0.006796 ** 
## zipcode98072   1.057e+05  3.779e+04   2.798 0.005149 ** 
## zipcode98074   1.660e+05  3.047e+04   5.446 5.22e-08 ***
## zipcode98075   1.647e+05  2.933e+04   5.616 1.98e-08 ***
## zipcode98077   8.469e+04  3.936e+04   2.152 0.031423 *  
## zipcode98092  -2.563e+04  1.499e+04  -1.710 0.087317 .  
## zipcode98102   4.802e+05  3.173e+04  15.135  < 2e-16 ***
## zipcode98103   2.789e+05  3.056e+04   9.127  < 2e-16 ***
## zipcode98105   4.236e+05  3.138e+04  13.501  < 2e-16 ***
## zipcode98106   1.000e+05  2.235e+04   4.476 7.66e-06 ***
## zipcode98107   2.845e+05  3.142e+04   9.056  < 2e-16 ***
## zipcode98108   9.989e+04  2.466e+04   4.050 5.15e-05 ***
## zipcode98109   4.503e+05  3.214e+04  14.011  < 2e-16 ***
## zipcode98112   5.978e+05  2.874e+04  20.799  < 2e-16 ***
## zipcode98115   2.722e+05  3.102e+04   8.776  < 2e-16 ***
## zipcode98116   2.317e+05  2.502e+04   9.264  < 2e-16 ***
## zipcode98117   2.531e+05  3.139e+04   8.065 7.90e-16 ***
## zipcode98118   1.418e+05  2.199e+04   6.451 1.15e-10 ***
## zipcode98119   4.275e+05  3.026e+04  14.125  < 2e-16 ***
## zipcode98122   2.803e+05  2.715e+04  10.324  < 2e-16 ***
## zipcode98125   1.424e+05  3.351e+04   4.250 2.15e-05 ***
## zipcode98126   1.514e+05  2.312e+04   6.550 5.93e-11 ***
## zipcode98133   1.032e+05  3.463e+04   2.979 0.002896 ** 
## zipcode98136   1.968e+05  2.376e+04   8.285  < 2e-16 ***
## zipcode98144   2.390e+05  2.527e+04   9.458  < 2e-16 ***
## zipcode98146   6.581e+04  2.128e+04   3.093 0.001983 ** 
## zipcode98148   4.226e+04  2.694e+04   1.569 0.116647    
## zipcode98155   8.102e+04  3.602e+04   2.249 0.024529 *  
## zipcode98166   1.976e+04  1.908e+04   1.036 0.300391    
## zipcode98168   5.061e+04  2.050e+04   2.469 0.013559 *  
## zipcode98177   1.439e+05  3.622e+04   3.974 7.10e-05 ***
## zipcode98178   3.099e+04  2.119e+04   1.463 0.143589    
## zipcode98188   1.236e+04  2.200e+04   0.562 0.574243    
## zipcode98198  -1.724e+04  1.669e+04  -1.033 0.301541    
## zipcode98199   3.029e+05  2.973e+04  10.191  < 2e-16 ***
## lat            1.429e+05  7.531e+04   1.898 0.057706 .  
## long          -1.654e+05  5.396e+04  -3.065 0.002184 ** 
## sqft_living15  1.631e+01  3.451e+00   4.726 2.32e-06 ***
## sqft_lot15    -8.556e-02  7.142e-02  -1.198 0.230925    
## year           7.046e+04  8.931e+03   7.889 3.26e-15 ***
## month02        3.832e+03  8.136e+03   0.471 0.637642    
## month03        2.945e+04  7.477e+03   3.939 8.23e-05 ***
## month04        3.251e+04  7.294e+03   4.457 8.35e-06 ***
## month05        5.571e+04  9.785e+03   5.693 1.27e-08 ***
## month06        6.683e+04  1.155e+04   5.786 7.34e-09 ***
## month07        6.152e+04  1.154e+04   5.333 9.80e-08 ***
## month08        6.255e+04  1.165e+04   5.369 8.02e-08 ***
## month09        5.910e+04  1.174e+04   5.036 4.81e-07 ***
## month10        6.246e+04  1.168e+04   5.348 9.01e-08 ***
## month11        6.098e+04  1.202e+04   5.073 3.95e-07 ***
## month12        6.167e+04  1.196e+04   5.157 2.54e-07 ***
## day           -1.079e+02  1.541e+02  -0.700 0.483967    
## renovated      8.124e+04  6.883e+03  11.804  < 2e-16 ***
## age            6.456e+02  7.723e+01   8.359  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 159300 on 15030 degrees of freedom
## Multiple R-squared:  0.8073, Adjusted R-squared:  0.806 
## F-statistic: 642.5 on 98 and 15030 DF,  p-value: < 2.2e-16
# Calculate the Mean Standard Error
MSE <- summary(house_lm)$sigma^2
print(MSE)
## [1] 25372645854

Analysis: The preliminary output indicates a robust model with an R-squared value of 80.66%, signifying that approximately 80% of the variability in house prices can be explained by the predictors in the model. The significant p-values across most variables confirm their relevance in predicting house prices. However, the substantial Residual Standard Error suggests room for improvement in model accuracy. The residual diagnostics will need to be carefully examined to identify potential violations of model assumptions and to strategize on improvements.

The house_lm model to predict price has: - Residual Standard Error of 159,500 on 15040 degrees of freedom - Adjusted R-squared of 80.55% - p-value is practically near 0 - Mean Standard Error is 25,460,114,812

Overall, the model appears to be statistically significant overall given the low p-value for the F-statistic. Most predictors (except sqft_lot15) have a statistically significant impact on the house price. Predictions on the test dataset should be done to further validate model usefulness.

Assumptions Testing

This section is devoted to examining the assumptions inherent in linear regression analysis. By evaluating the residuals from the fitted model, we can assess whether the assumptions of linearity, normality, homoscedasticity, and the absence of influential outliers are met. These diagnostics are crucial to ensure the validity of the model’s inferences.

# Plotting the Regular Model
par(mfrow=c(2,2))
plot(house_lm)

Analysis: The diagnostic plots from the model indicate potential issues with the regression model. We observe the following: - LINEARITY ASSUMPTION: The Residuals vs Fitted plot shows a non-random pattern, suggesting that linearity assumptions may be violated. There’s a cone shape along the line, which is mostly horizontal along 0. - NORMALITY ASSUMPTION: The Q-Q plot of residuals reveals deviations from normality, particularly in the tails. The residuals do not meet the normality assumption. - CONSTANT VARIANCE ASSUMPTION: The Scale-Location plot indicates heteroscedasticity, implying that the variance of residuals is not constant. - The HOMOSCEDASTICITY ASSUMPTION is violated. Moreover the line is curvilinear. - Lastly, the Residuals vs Leverage plot flags several potential outliers and influential points that could unduly affect the model’s predictions.

These observations suggest that the model may benefit from transformations or robust regression techniques to meet the necessary assumptions.

Variable Inflation Factor (VIF) Analysis

The Variable Inflation Factor (VIF) analysis is conducted to assess multicollinearity among predictors in the regression model. VIF quantifies how much the variance of an estimated regression coefficient increases if your predictors are correlated. A VIF above 5 suggests a problematic level of multicollinearity.

# Variable Inflation Factor
vif(house_lm)
##                      GVIF Df GVIF^(1/(2*Df))
## bedrooms         1.674600  1        1.294063
## bathrooms        3.369428  1        1.835600
## sqft_lot         2.102902  1        1.450139
## floors           2.460427  1        1.568575
## waterfront       1.239958  1        1.113534
## view             1.503010  1        1.225973
## condition        1.335554  1        1.155662
## grade            3.805459  1        1.950759
## sqft_above       5.255421  1        2.292470
## sqft_basement    2.063746  1        1.436574
## zipcode       7720.300239 69        1.067017
## lat             64.589892  1        8.036784
## long            34.622621  1        5.884099
## sqft_living15    3.364617  1        1.834289
## sqft_lot15       2.236179  1        1.495386
## year            10.408154  1        3.226167
## month           11.233221 11        1.116221
## day              1.057673  1        1.028432
## renovated        1.152842  1        1.073705
## age              2.963422  1        1.721459

Analysis: The VIF statistics indicate that ‘sqft_above’ and ‘zipcode’ display significant multicollinearity, with ‘zipcode’ showing an exceptionally high VIF due to numerous dummy variables. Additionally, ‘lat’ and ‘long’ have high VIF values, suggesting that geographical variables are interrelated. These high VIF values suggest a need to consider reducing multicollinearity, possibly by combining related variables, removing some predictors, or applying regularization techniques to improve the model’s predictive performance and interpretability.

Refinement of Predictive Model Post-VIF Analysis

The process of refining a predictive model often involves the elimination of variables that cause multicollinearity. This subsection illustrates how the identification of high VIF scores led to the removal of the ‘lat’ and ‘long’ variables, which are likely to be interdependent, and the subsequent reevaluation of the model.

Removing ‘zipcode’ actually reduces the model predictability from 80% to about 65%, we’re keep it as is and will work on some transformation later on.

# Remove redundant, unnecessary columns from dataset. 
df_house[c("lat", "long")] <- list(NULL)
train.dat[c("lat", "long")] <- list(NULL)
test.dat[c("lat", "long")] <- list(NULL)

set.seed(1023)
n<-dim(df_house)[1]
IND<-sample(c(1:n),round(n*0.7))
train.dat<-df_house[IND,]
test.dat<-df_house[-c(IND),]

dim(train.dat)
## [1] 15129    19
dim(test.dat)
## [1] 6484   19
# Refitting the Model after removing Latitude and Longitude
house_lm<-lm(price ~.,data=train.dat)
summary(house_lm)
## 
## Call:
## lm(formula = price ~ ., data = train.dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1139159   -69641     -328    62052  4434229 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.429e+08  1.800e+07  -7.936 2.24e-15 ***
## bedrooms      -2.464e+04  1.784e+03 -13.810  < 2e-16 ***
## bathrooms      2.166e+04  3.088e+03   7.012 2.45e-12 ***
## sqft_lot       2.561e-01  4.811e-02   5.323 1.04e-07 ***
## floors        -4.126e+04  3.765e+03 -10.960  < 2e-16 ***
## waterfront     6.169e+05  1.754e+04  35.181  < 2e-16 ***
## view           5.774e+04  2.100e+03  27.490  < 2e-16 ***
## condition      2.808e+04  2.289e+03  12.267  < 2e-16 ***
## grade          5.753e+04  2.154e+03  26.707  < 2e-16 ***
## sqft_above     2.018e+02  3.605e+00  55.977  < 2e-16 ***
## sqft_basement  1.298e+02  4.197e+00  30.938  < 2e-16 ***
## zipcode98002   2.559e+04  1.648e+04   1.553 0.120558    
## zipcode98003  -1.587e+04  1.519e+04  -1.045 0.296127    
## zipcode98004   7.687e+05  1.478e+04  51.998  < 2e-16 ***
## zipcode98005   2.865e+05  1.759e+04  16.288  < 2e-16 ***
## zipcode98006   2.642e+05  1.335e+04  19.784  < 2e-16 ***
## zipcode98007   2.404e+05  1.891e+04  12.715  < 2e-16 ***
## zipcode98008   2.416e+05  1.516e+04  15.938  < 2e-16 ***
## zipcode98010   6.927e+04  2.142e+04   3.234 0.001223 ** 
## zipcode98011   1.191e+05  1.702e+04   6.996 2.74e-12 ***
## zipcode98014   1.207e+05  2.045e+04   5.905 3.61e-09 ***
## zipcode98019   9.060e+04  1.677e+04   5.402 6.70e-08 ***
## zipcode98022  -1.174e+04  1.618e+04  -0.726 0.467936    
## zipcode98023  -3.223e+04  1.307e+04  -2.466 0.013686 *  
## zipcode98024   1.365e+05  2.402e+04   5.682 1.36e-08 ***
## zipcode98027   1.731e+05  1.366e+04  12.677  < 2e-16 ***
## zipcode98028   1.147e+05  1.534e+04   7.479 7.93e-14 ***
## zipcode98029   2.105e+05  1.450e+04  14.521  < 2e-16 ***
## zipcode98030   4.015e+03  1.542e+04   0.260 0.794535    
## zipcode98031   9.392e+03  1.504e+04   0.624 0.532368    
## zipcode98032  -4.147e+03  1.990e+04  -0.208 0.834923    
## zipcode98033   3.570e+05  1.360e+04  26.252  < 2e-16 ***
## zipcode98034   2.037e+05  1.284e+04  15.861  < 2e-16 ***
## zipcode98038   3.023e+04  1.253e+04   2.414 0.015798 *  
## zipcode98039   1.124e+06  2.929e+04  38.380  < 2e-16 ***
## zipcode98040   5.047e+05  1.541e+04  32.741  < 2e-16 ***
## zipcode98042   2.372e+03  1.274e+04   0.186 0.852291    
## zipcode98045   9.570e+04  1.615e+04   5.925 3.18e-09 ***
## zipcode98052   2.264e+05  1.258e+04  17.992  < 2e-16 ***
## zipcode98053   1.879e+05  1.396e+04  13.459  < 2e-16 ***
## zipcode98055   5.093e+04  1.516e+04   3.359 0.000783 ***
## zipcode98056   9.827e+04  1.353e+04   7.262 4.00e-13 ***
## zipcode98058   2.458e+04  1.332e+04   1.845 0.064990 .  
## zipcode98059   7.785e+04  1.319e+04   5.904 3.63e-09 ***
## zipcode98065   8.422e+04  1.458e+04   5.775 7.84e-09 ***
## zipcode98070  -1.785e+04  2.068e+04  -0.863 0.387968    
## zipcode98072   1.459e+05  1.537e+04   9.490  < 2e-16 ***
## zipcode98074   1.733e+05  1.346e+04  12.874  < 2e-16 ***
## zipcode98075   1.632e+05  1.424e+04  11.462  < 2e-16 ***
## zipcode98077   1.126e+05  1.732e+04   6.504 8.09e-11 ***
## zipcode98092  -4.339e+04  1.398e+04  -3.102 0.001923 ** 
## zipcode98102   5.344e+05  2.034e+04  26.271  < 2e-16 ***
## zipcode98103   3.430e+05  1.299e+04  26.407  < 2e-16 ***
## zipcode98105   4.777e+05  1.643e+04  29.081  < 2e-16 ***
## zipcode98106   1.461e+05  1.422e+04  10.273  < 2e-16 ***
## zipcode98107   3.530e+05  1.559e+04  22.648  < 2e-16 ***
## zipcode98108   1.399e+05  1.680e+04   8.329  < 2e-16 ***
## zipcode98109   5.091e+05  2.079e+04  24.483  < 2e-16 ***
## zipcode98112   6.472e+05  1.587e+04  40.784  < 2e-16 ***
## zipcode98115   3.306e+05  1.296e+04  25.515  < 2e-16 ***
## zipcode98116   2.893e+05  1.442e+04  20.068  < 2e-16 ***
## zipcode98117   3.243e+05  1.307e+04  24.814  < 2e-16 ***
## zipcode98118   1.757e+05  1.317e+04  13.338  < 2e-16 ***
## zipcode98119   4.899e+05  1.704e+04  28.745  < 2e-16 ***
## zipcode98122   3.274e+05  1.517e+04  21.586  < 2e-16 ***
## zipcode98125   2.052e+05  1.375e+04  14.929  < 2e-16 ***
## zipcode98126   2.018e+05  1.441e+04  14.000  < 2e-16 ***
## zipcode98133   1.759e+05  1.311e+04  13.416  < 2e-16 ***
## zipcode98136   2.485e+05  1.565e+04  15.883  < 2e-16 ***
## zipcode98144   2.826e+05  1.437e+04  19.664  < 2e-16 ***
## zipcode98146   1.077e+05  1.520e+04   7.087 1.43e-12 ***
## zipcode98148   7.008e+04  2.505e+04   2.797 0.005159 ** 
## zipcode98155   1.499e+05  1.338e+04  11.203  < 2e-16 ***
## zipcode98166   5.337e+04  1.528e+04   3.493 0.000479 ***
## zipcode98168   8.194e+04  1.546e+04   5.300 1.17e-07 ***
## zipcode98177   2.219e+05  1.570e+04  14.132  < 2e-16 ***
## zipcode98178   5.395e+04  1.553e+04   3.474 0.000513 ***
## zipcode98188   3.333e+04  1.947e+04   1.712 0.086930 .  
## zipcode98198   1.814e+03  1.539e+04   0.118 0.906177    
## zipcode98199   3.717e+05  1.470e+04  25.291  < 2e-16 ***
## sqft_living15  1.645e+01  3.452e+00   4.766 1.90e-06 ***
## sqft_lot15    -9.721e-02  7.137e-02  -1.362 0.173179    
## year           7.062e+04  8.935e+03   7.904 2.88e-15 ***
## month02        4.026e+03  8.139e+03   0.495 0.620881    
## month03        2.951e+04  7.480e+03   3.945 8.01e-05 ***
## month04        3.238e+04  7.297e+03   4.438 9.16e-06 ***
## month05        5.585e+04  9.789e+03   5.706 1.18e-08 ***
## month06        6.689e+04  1.156e+04   5.789 7.23e-09 ***
## month07        6.171e+04  1.154e+04   5.347 9.06e-08 ***
## month08        6.267e+04  1.165e+04   5.377 7.68e-08 ***
## month09        5.937e+04  1.174e+04   5.058 4.29e-07 ***
## month10        6.246e+04  1.168e+04   5.346 9.11e-08 ***
## month11        6.121e+04  1.202e+04   5.091 3.60e-07 ***
## month12        6.190e+04  1.196e+04   5.175 2.31e-07 ***
## day           -1.030e+02  1.542e+02  -0.668 0.504236    
## renovated      8.147e+04  6.885e+03  11.833  < 2e-16 ***
## age            6.530e+02  7.722e+01   8.456  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 159400 on 15032 degrees of freedom
## Multiple R-squared:  0.8071, Adjusted R-squared:  0.8059 
## F-statistic: 655.2 on 96 and 15032 DF,  p-value: < 2.2e-16
# Variable Inflation Factor of the adjusted Model 
vif(house_lm)
##                    GVIF Df GVIF^(1/(2*Df))
## bedrooms       1.674534  1        1.294038
## bathrooms      3.369004  1        1.835485
## sqft_lot       2.100313  1        1.449246
## floors         2.458115  1        1.567838
## waterfront     1.238838  1        1.113031
## view           1.502346  1        1.225702
## condition      1.334545  1        1.155225
## grade          3.801438  1        1.949728
## sqft_above     5.250120  1        2.291314
## sqft_basement  2.063651  1        1.436541
## zipcode        5.900596 69        1.012946
## sqft_living15  3.364159  1        1.834164
## sqft_lot15     2.231436  1        1.493799
## year          10.407812  1        3.226114
## month         11.223241 11        1.116176
## day            1.057586  1        1.028390
## renovated      1.152723  1        1.073649
## age            2.960382  1        1.720576

Analysis: After removing the ‘lat’ and ‘long’ variables to address multicollinearity, the model was re-fitted, resulting in an unchanged R-squared value of approximately 80.65%. This indicates that the removal of these variables did not significantly affect the model’s explanatory power. The residual standard error remains around 159,600, and the p-value is still less than 2.2e-16, suggesting that the model remains statistically significant. The GVIF values post-refinement show that while multicollinearity may still be present, its severity has decreased, pointing towards an improved model fit and predictive capability.

Evaluating Model Efficacy with Test Data

This subsection assesses the linear regression model’s performance on the test data. By applying the model to unseen data, we measure its predictive accuracy and generalizability beyond the training dataset.

# Test data Predictions
house_lm_test_pred <- predict(house_lm, newdata = test.dat)

house_lm_test_mse <- mean((house_lm_test_pred - test.dat$price)^2)
house_lm_test_rmse <- sqrt(house_lm_test_mse)
house_lm_test_residuals <- test.dat$price - house_lm_test_pred
house_lm_test_rsq <- 1 - var(house_lm_test_residuals) / var(test.dat$price)
house_lm_test_sse <- sum((test.dat$price - house_lm_test_pred)^2)

# Add the predictions to the results
results.df <- data.frame(model = "Linear Regression Test Data Predictions",
                         R.Squared.Train = summary(house_lm)$r.square,
                         R.Squared.Test = house_lm_test_rsq,
                         RMSE.test = house_lm_test_rmse,
                         SSE.test = house_lm_test_sse)

print(results.df)
##                                     model R.Squared.Train R.Squared.Test
## 1 Linear Regression Test Data Predictions       0.8071138      0.8125711
##   RMSE.test    SSE.test
## 1  164316.8 1.75068e+14

Analysis: The model demonstrates robust predictive performance with an R-squared of 0.8124 on the test data, indicating that about 81.24% of the variability in the test price data is explained by the model. This is a slight increase from the training R-squared of 80.65%, suggesting good model generalization. The RMSE of 164,370.3 quantifies the average deviation between the predicted and actual prices, and the SSE of 1.75182e+14 represents the total deviation squared, both of which are metrics used to gauge the model’s accuracy and precision on new data.

Enhancing Model Accuracy by Dropping Insignificant Predictors

This segment of the analysis concentrates on refining the model by discarding predictors that do not contribute significantly to the prediction of house prices. This is achieved by examining the p-values of the predictors and eliminating those that surpass a chosen significance level, thereby simplifying the model without notably affecting its explanatory power.

The columns ‘sqft_lot15’ and ‘year’ are being dropped for having a p-value > 0.05, subsequently ‘year’ and ‘month’ are also being dropped because of their similarity to the calculated column ‘age’ that takes year built or renovated to determine how modern the construction might be.

# Remove predictors from datasets
train.dat <- subset(train.dat, select = -c(sqft_lot15, day))
test.dat <- subset(test.dat, select = -c(sqft_lot15, day))

# Refit the model with updated datasets
house_lm <- lm(price ~ ., data = train.dat)
summary(house_lm)
## 
## Call:
## lm(formula = price ~ ., data = train.dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1137515   -69444     -293    62073  4436770 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.446e+08  1.771e+07  -8.164 3.50e-16 ***
## bedrooms      -2.456e+04  1.783e+03 -13.771  < 2e-16 ***
## bathrooms      2.173e+04  3.088e+03   7.037 2.05e-12 ***
## sqft_lot       2.141e-01  3.689e-02   5.803 6.64e-09 ***
## floors        -4.113e+04  3.763e+03 -10.930  < 2e-16 ***
## waterfront     6.174e+05  1.753e+04  35.213  < 2e-16 ***
## view           5.771e+04  2.100e+03  27.477  < 2e-16 ***
## condition      2.806e+04  2.289e+03  12.259  < 2e-16 ***
## grade          5.757e+04  2.154e+03  26.728  < 2e-16 ***
## sqft_above     2.016e+02  3.602e+00  55.968  < 2e-16 ***
## sqft_basement  1.297e+02  4.196e+00  30.916  < 2e-16 ***
## zipcode98002   2.558e+04  1.648e+04   1.552 0.120637    
## zipcode98003  -1.584e+04  1.519e+04  -1.043 0.297145    
## zipcode98004   7.689e+05  1.478e+04  52.015  < 2e-16 ***
## zipcode98005   2.863e+05  1.759e+04  16.281  < 2e-16 ***
## zipcode98006   2.645e+05  1.335e+04  19.808  < 2e-16 ***
## zipcode98007   2.407e+05  1.890e+04  12.734  < 2e-16 ***
## zipcode98008   2.418e+05  1.516e+04  15.953  < 2e-16 ***
## zipcode98010   6.749e+04  2.137e+04   3.158 0.001592 ** 
## zipcode98011   1.192e+05  1.702e+04   7.007 2.53e-12 ***
## zipcode98014   1.182e+05  2.036e+04   5.805 6.56e-09 ***
## zipcode98019   8.934e+04  1.675e+04   5.335 9.68e-08 ***
## zipcode98022  -1.302e+04  1.615e+04  -0.806 0.420106    
## zipcode98023  -3.219e+04  1.307e+04  -2.463 0.013791 *  
## zipcode98024   1.329e+05  2.388e+04   5.566 2.65e-08 ***
## zipcode98027   1.725e+05  1.364e+04  12.646  < 2e-16 ***
## zipcode98028   1.147e+05  1.534e+04   7.481 7.80e-14 ***
## zipcode98029   2.107e+05  1.450e+04  14.534  < 2e-16 ***
## zipcode98030   4.216e+03  1.542e+04   0.274 0.784456    
## zipcode98031   9.564e+03  1.504e+04   0.636 0.524866    
## zipcode98032  -4.205e+03  1.990e+04  -0.211 0.832623    
## zipcode98033   3.573e+05  1.360e+04  26.271  < 2e-16 ***
## zipcode98034   2.038e+05  1.284e+04  15.874  < 2e-16 ***
## zipcode98038   3.003e+04  1.252e+04   2.398 0.016496 *  
## zipcode98039   1.124e+06  2.929e+04  38.383  < 2e-16 ***
## zipcode98040   5.050e+05  1.541e+04  32.766  < 2e-16 ***
## zipcode98042   2.361e+03  1.274e+04   0.185 0.853004    
## zipcode98045   9.539e+04  1.615e+04   5.907 3.55e-09 ***
## zipcode98052   2.266e+05  1.258e+04  18.008  < 2e-16 ***
## zipcode98053   1.871e+05  1.395e+04  13.410  < 2e-16 ***
## zipcode98055   5.117e+04  1.516e+04   3.376 0.000738 ***
## zipcode98056   9.848e+04  1.353e+04   7.278 3.56e-13 ***
## zipcode98058   2.473e+04  1.332e+04   1.856 0.063445 .  
## zipcode98059   7.798e+04  1.319e+04   5.914 3.41e-09 ***
## zipcode98065   8.465e+04  1.458e+04   5.806 6.53e-09 ***
## zipcode98070  -2.165e+04  2.049e+04  -1.057 0.290612    
## zipcode98072   1.457e+05  1.537e+04   9.481  < 2e-16 ***
## zipcode98074   1.734e+05  1.346e+04  12.882  < 2e-16 ***
## zipcode98075   1.634e+05  1.424e+04  11.479  < 2e-16 ***
## zipcode98077   1.115e+05  1.729e+04   6.445 1.19e-10 ***
## zipcode98092  -4.423e+04  1.397e+04  -3.166 0.001547 ** 
## zipcode98102   5.348e+05  2.034e+04  26.293  < 2e-16 ***
## zipcode98103   3.433e+05  1.299e+04  26.428  < 2e-16 ***
## zipcode98105   4.780e+05  1.643e+04  29.099  < 2e-16 ***
## zipcode98106   1.463e+05  1.422e+04  10.285  < 2e-16 ***
## zipcode98107   3.532e+05  1.558e+04  22.666  < 2e-16 ***
## zipcode98108   1.402e+05  1.679e+04   8.345  < 2e-16 ***
## zipcode98109   5.093e+05  2.079e+04  24.496  < 2e-16 ***
## zipcode98112   6.478e+05  1.586e+04  40.834  < 2e-16 ***
## zipcode98115   3.308e+05  1.295e+04  25.536  < 2e-16 ***
## zipcode98116   2.897e+05  1.441e+04  20.098  < 2e-16 ***
## zipcode98117   3.245e+05  1.307e+04  24.833  < 2e-16 ***
## zipcode98118   1.760e+05  1.317e+04  13.367  < 2e-16 ***
## zipcode98119   4.903e+05  1.704e+04  28.772  < 2e-16 ***
## zipcode98122   3.278e+05  1.517e+04  21.612  < 2e-16 ***
## zipcode98125   2.055e+05  1.375e+04  14.947  < 2e-16 ***
## zipcode98126   2.021e+05  1.441e+04  14.021  < 2e-16 ***
## zipcode98133   1.761e+05  1.311e+04  13.434  < 2e-16 ***
## zipcode98136   2.487e+05  1.565e+04  15.898  < 2e-16 ***
## zipcode98144   2.829e+05  1.437e+04  19.687  < 2e-16 ***
## zipcode98146   1.078e+05  1.520e+04   7.095 1.35e-12 ***
## zipcode98148   7.020e+04  2.505e+04   2.802 0.005085 ** 
## zipcode98155   1.499e+05  1.338e+04  11.210  < 2e-16 ***
## zipcode98166   5.327e+04  1.528e+04   3.486 0.000491 ***
## zipcode98168   8.205e+04  1.546e+04   5.308 1.12e-07 ***
## zipcode98177   2.220e+05  1.570e+04  14.142  < 2e-16 ***
## zipcode98178   5.398e+04  1.553e+04   3.476 0.000510 ***
## zipcode98188   3.328e+04  1.947e+04   1.710 0.087358 .  
## zipcode98198   1.894e+03  1.539e+04   0.123 0.902031    
## zipcode98199   3.722e+05  1.470e+04  25.327  < 2e-16 ***
## sqft_living15  1.623e+01  3.448e+00   4.707 2.53e-06 ***
## year           7.148e+04  8.791e+03   8.132 4.57e-16 ***
## month02        3.976e+03  8.136e+03   0.489 0.625040    
## month03        2.933e+04  7.479e+03   3.922 8.84e-05 ***
## month04        3.225e+04  7.295e+03   4.420 9.92e-06 ***
## month05        5.658e+04  9.650e+03   5.863 4.63e-09 ***
## month06        6.766e+04  1.142e+04   5.924 3.22e-09 ***
## month07        6.244e+04  1.142e+04   5.469 4.59e-08 ***
## month08        6.344e+04  1.152e+04   5.506 3.74e-08 ***
## month09        6.026e+04  1.160e+04   5.194 2.09e-07 ***
## month10        6.317e+04  1.156e+04   5.464 4.74e-08 ***
## month11        6.222e+04  1.185e+04   5.252 1.52e-07 ***
## month12        6.291e+04  1.177e+04   5.345 9.17e-08 ***
## renovated      8.133e+04  6.884e+03  11.814  < 2e-16 ***
## age            6.524e+02  7.722e+01   8.448  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 159400 on 15034 degrees of freedom
## Multiple R-squared:  0.8071, Adjusted R-squared:  0.8059 
## F-statistic: 669.1 on 94 and 15034 DF,  p-value: < 2.2e-16
# Display the Regression function of Price
# Assuming you have a function dispRegFunc to display the regression equation
output <- capture.output(dispRegFunc(house_lm))

# Print the output
cat(paste(strwrap(output, width=80), collapse="\n"))
## [1] "Y = -144613592.176341 + -24558.153807bedrooms + 21727.326189bathrooms +
## 0.214092sqft_lot + -41133.703581floors + 617365.104768waterfront +
## 57705.727157view + 28058.254875condition + 57568.423976grade +
## 201.572167sqft_above + 129.72015sqft_basement + 25579.815527zipcode98002 +
## -15838.216892zipcode98003 + 768945.481254zipcode98004 +
## 286299.57958zipcode98005 + 264482.159962zipcode98006 +
## 240710.739984zipcode98007 + 241772.834191zipcode98008 + 67491.46554zipcode98010
## + 119244.110966zipcode98011 + 118191.332813zipcode98014 +
## 89340.433278zipcode98019 + -13017.628691zipcode98022 +
## -32193.950826zipcode98023 + 132929.486913zipcode98024 +
## 172504.171655zipcode98027 + 114746.335898zipcode98028 +
## 210706.693434zipcode98029 + 4216.489404zipcode98030 + 9563.911991zipcode98031 +
## -4205.105932zipcode98032 + 357263.147352zipcode98033 +
## 203811.354184zipcode98034 + 30032.152834zipcode98038 +
## 1124122.428578zipcode98039 + 505014.68266zipcode98040 + 2360.630014zipcode98042
## + 95385.912287zipcode98045 + 226584.45744zipcode98052 +
## 187078.266364zipcode98053 + 51169.581687zipcode98055 + 98480.971348zipcode98056
## + 24725.581137zipcode98058 + 77978.517263zipcode98059 + 84647.71318zipcode98065
## + -21652.681896zipcode98070 + 145726.186299zipcode98072 +
## 173377.843054zipcode98074 + 163413.191304zipcode98075 +
## 111465.143488zipcode98077 + -44226.376599zipcode98092 +
## 534787.459344zipcode98102 + 343260.527571zipcode98103 +
## 477996.624014zipcode98105 + 146276.802596zipcode98106 +
## 353220.475277zipcode98107 + 140156.577506zipcode98108 +
## 509323.195214zipcode98109 + 647756.504821zipcode98112 +
## 330795.899596zipcode98115 + 289697.623578zipcode98116 +
## 324526.72628zipcode98117 + 176017.624876zipcode98118 +
## 490317.122004zipcode98119 + 327783.463174zipcode98122 +
## 205462.988411zipcode98125 + 202069.555692zipcode98126 +
## 176076.600961zipcode98133 + 248718.79151zipcode98136 + 282925.93056zipcode98144
## + 107815.107238zipcode98146 + 70197.252168zipcode98148 +
## 149929.972523zipcode98155 + 53265.197101zipcode98166 + 82051.051169zipcode98168
## + 222005.000985zipcode98177 + 53976.249781zipcode98178 +
## 33280.168179zipcode98188 + 1894.340688zipcode98198 + 372173.573595zipcode98199
## + 16.229056sqft_living15 + 71483.575383year + 3976.456535month02 +
## 29330.05104month03 + 32247.632136month04 + 56582.11854month05 +
## 67657.970585month06 + 62436.21554month07 + 63442.641208month08 +
## 60257.437673month09 + 63168.12952month10 + 62220.912255month11 +
## 62906.163713month12 + 81334.409787renovated + 652.355305age"

Analysis: Post removal of predictors deemed insignificant (p-value > 0.05), the model’s R-squared decreased insignificantly, indicating a negligible loss in explanatory power. This reduction is counterbalanced by the benefits of a more parsimonious model, which can lead to better generalization on unseen data. The remaining predictors continue to display strong significance levels (p < 0.05), assuring their relevance in price prediction. The residual standard error remains relatively stable, further affirming the refined model’s validity.

Prediction Power of the new Model

# Test data Predictions
house_lm_test_pred <- predict(house_lm, newdata = test.dat)

house_lm_test_mse <- mean((house_lm_test_pred - test.dat$price)^2)
house_lm_test_rmse <- sqrt(house_lm_test_mse)
house_lm_test_residuals <- test.dat$price - house_lm_test_pred
house_lm_test_rsq <- 1 - var(house_lm_test_residuals) / var(test.dat$price)
house_lm_test_sse <- sum((test.dat$price - house_lm_test_pred)^2)

# Append the predictions after predictors removal to the results
results.df = rbind(results.df, data.frame(
  model = "Linear Regression Test without Insignificant Predictors",
  R.Squared.Train = summary(house_lm)$r.square,
  R.Squared.Test = house_lm_test_rsq,
  RMSE.test = house_lm_test_rmse,
  SSE.test = house_lm_test_sse))

print(results.df)
##                                                     model R.Squared.Train
## 1                 Linear Regression Test Data Predictions       0.8071138
## 2 Linear Regression Test without Insignificant Predictors       0.8070843
##   R.Squared.Test RMSE.test    SSE.test
## 1      0.8125711  164316.8 1.75068e+14
## 2      0.8124792  164357.6 1.75155e+14

Diagnostic Plotting for Residual Analysis

This section is dedicated to visually diagnosing the regression model’s fit using residual plots. These plots are essential tools for detecting issues with model assumptions such as linearity, normality, and homoscedasticity.

Boxplot of Residuals

The boxplot provides a visual summary of the residuals’ distribution, offering insights into potential outliers and the central tendency of the model’s errors.

# Create a Data Frame with the residuals
ei <- house_lm$residuals
ei_df <- data.frame(residuals = ei)

# Create a Boxplot of Residuals
ggplot(ei_df, aes(y = residuals)) +
  geom_boxplot(fill = "#6baed6", color = "black") +
  coord_flip() +  # To make the boxplot horizontal
  labs(title = "House Price Regression Residuals", x = "") +
  theme_minimal() +
  theme(
    axis.title.y = element_blank(),  
    axis.text.y = element_blank(), 
  )

Analysis: There are individual points far from the central box, which are potential outliers. These points suggest instances where the model predictions were significantly off from the actual house prices. The presence of these outliers could be due to unusual or extreme values within the dataset that the model could not accurately predict, or they could indicate potential leverage points that are influencing the model fit. While the residuals being centered around zero is a good sign, the significant outliers indicate that there might be instances where the model fails to capture the underlying pattern. These could be due to the model not accounting for non-linear relationships or interactions between predictors, or they could be a result of data issues such as incorrect entries, extreme values, or influential points that have a disproportionate effect on the model. In summary, the boxplot suggests that while the model is generally good at predicting house prices, there are a few cases where it performs poorly. The reasons for these poor predictions should be explored further. This could include looking into the data points corresponding to the outliers to understand what might be causing these large residuals and considering model improvements or data transformations if necessary.

Residuals vs. Fitted Values Plot

This plot examines the relationship between residuals and predicted values, crucial for identifying non-linear patterns that a linear model may not capture.

# Create Fitted vs Residuals Plot
ggplot(data = data.frame(fitted = house_lm$fitted.values, residuals = ei), aes(x = fitted, y = residuals)) +
  geom_point(color = "#6baed6") +
  geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
  labs(x = "Fitted Values", y = "Residuals", title = "Fitted vs. Residuals Plot") +
  theme_minimal()

Analysis: This plot suggests that while the model may be suitable for a range of predictions, there are certain areas, especially at the higher end of the price spectrum, where the model’s assumptions do not hold, and its predictive accuracy is compromised. Further investigation and possible model refinement are warranted.

Analysis of Variance (ANOVA) for Model Comparison

The ANOVA test has been conducted to compare the variance explained by each predictor variable in the linear regression model to the overall variance of the response variable, house prices. The F-statistic is used to test the null hypothesis that a predictor’s group means are equal (i.e., the variable has no effect) against the alternative hypothesis that at least one group mean is different.

# Run Analysis of Variance on the current model
anova(house_lm)

Analysis: The ANOVA indicates that most of the variables included in the model are significant predictors of house prices. However, the significance of floors is not established, and the high residual sum of squares suggests room for model improvement, possibly by including additional predictors or refining existing ones. Some details: - Significant Predictors: Almost all variables, except floors, have a highly significant p-value (less than 0.05), indicated by “< 2.2e-16”, which means these variables contribute significantly to the model and have a strong association with the house price. - Floors Variable: The floors variable, however, has a p-value of 0.229, which is greater than the typical alpha level of 0.05, suggesting that this predictor does not have a statistically significant effect on the house price within the model. - Effect Sizes: The Sum Sq column represents the total variance explained by each variable. For example, bathrooms, with the highest Sum Sq value, contribute a large effect to the price, indicating that as the number of bathrooms in a house increases, the price is significantly affected. - Predictor Importance: The F-value column gives a sense of the relative importance of each predictor. Higher values indicate a greater impact on the response variable. Here, bathrooms and grade have the highest F values, suggesting they are key predictors of house prices in the model. - Model Adequacy: The Residuals row shows the variance that is not explained by the model. A large Sum Sq for residuals suggests that there might be other factors affecting house prices that are not included in the model.

Statistical Tests for Individual Coefficients

Individual coefficient significance testing within a regression model is crucial to identify which predictors have a statistically significant impact on the response variable. In this case, a two-tailed t-test is employed for each predictor to test the null hypothesis (\(H_0\)) that the coefficient is equal to zero (no effect) against the alternative hypothesis (HA) that the coefficient is not equal to zero (significant effect). The critical t-value is determined based on the alpha level (\(\alpha=0.01\)) and the degrees of freedom associated with the model.

A two sides t statistical test is used for testing individual coefficients and their significance. The critical t -value applicable given \(\alpha=0.01\) and n-2 degrees of freedom is 2.576. \(H_0\): intercept = 0, \(H_a\): intercept is not 0 \(H_0\): slope = 0 , \(H_a\): slope is not 0 Decision rule: when t > t critical reject null hypothesis

The model, as indicated by the significant coefficients, suggests that features such as the number of bedrooms, bathrooms, square footage of living space, waterfront status, view, grade, etc, play a statistically significant role in predicting the house price.

# We divide by 2 because this is a two tail test
# ... if alpha=0.05, then use .05/2
conf <- 0.01/2

# Manually calculate the degrees of freedom
df <- 21613-2            
value <- formatC(qt(conf, df, lower.tail = FALSE))    
print(paste("Critical T values: ", value))
## [1] "Critical T values:  2.576"
# Extract coefficients in matrix
matrix_coef <- summary(house_lm)$coefficients
matrix_coef   
##                    Estimate   Std. Error     t value      Pr(>|t|)
## (Intercept)   -1.446136e+08 1.771378e+07  -8.1639047  3.499240e-16
## bedrooms      -2.455815e+04 1.783311e+03 -13.7710981  6.926406e-43
## bathrooms      2.172733e+04 3.087577e+03   7.0370160  2.048900e-12
## sqft_lot       2.140916e-01 3.689242e-02   5.8031334  6.639756e-09
## floors        -4.113370e+04 3.763477e+03 -10.9297077  1.056600e-27
## waterfront     6.173651e+05 1.753232e+04  35.2129755 4.446408e-261
## view           5.770573e+04 2.100160e+03  27.4768230 3.278564e-162
## condition      2.805825e+04 2.288806e+03  12.2589023  2.196766e-34
## grade          5.756842e+04 2.153870e+03  26.7278969 8.541913e-154
## sqft_above     2.015722e+02 3.601539e+00  55.9683440  0.000000e+00
## sqft_basement  1.297202e+02 4.195940e+00  30.9156343 1.624753e-203
## zipcode98002   2.557982e+04 1.647981e+04   1.5521913  1.206375e-01
## zipcode98003  -1.583822e+04 1.519092e+04  -1.0426109  2.971453e-01
## zipcode98004   7.689455e+05 1.478302e+04  52.0154539  0.000000e+00
## zipcode98005   2.862996e+05 1.758541e+04  16.2805166  4.344184e-59
## zipcode98006   2.644822e+05 1.335244e+04  19.8077713  3.199552e-86
## zipcode98007   2.407107e+05 1.890354e+04  12.7336337  5.967012e-37
## zipcode98008   2.417728e+05 1.515512e+04  15.9532069  7.917530e-57
## zipcode98010   6.749147e+04 2.137241e+04   3.1578777  1.592365e-03
## zipcode98011   1.192441e+05 1.701683e+04   7.0074223  2.530664e-12
## zipcode98014   1.181913e+05 2.035940e+04   5.8052456  6.556777e-09
## zipcode98019   8.934043e+04 1.674511e+04   5.3353160  9.676074e-08
## zipcode98022  -1.301763e+04 1.614570e+04  -0.8062598  4.201058e-01
## zipcode98023  -3.219395e+04 1.307124e+04  -2.4629612  1.379061e-02
## zipcode98024   1.329295e+05 2.388251e+04   5.5659754  2.651475e-08
## zipcode98027   1.725042e+05 1.364133e+04  12.6457042  1.811952e-36
## zipcode98028   1.147463e+05 1.533883e+04   7.4807745  7.796915e-14
## zipcode98029   2.107067e+05 1.449729e+04  14.5342110  1.544974e-47
## zipcode98030   4.216489e+03 1.541559e+04   0.2735211  7.844564e-01
## zipcode98031   9.563912e+03 1.504051e+04   0.6358770  5.248662e-01
## zipcode98032  -4.205106e+03 1.989719e+04  -0.2113417  8.326235e-01
## zipcode98033   3.572631e+05 1.359899e+04  26.2712916 9.100559e-149
## zipcode98034   2.038114e+05 1.283906e+04  15.8743158  2.734606e-56
## zipcode98038   3.003215e+04 1.252367e+04   2.3980311  1.649558e-02
## zipcode98039   1.124122e+06 2.928702e+04  38.3829572 1.399421e-307
## zipcode98040   5.050147e+05 1.541260e+04  32.7663564 1.632046e-227
## zipcode98042   2.360630e+03 1.274017e+04   0.1852903  8.530038e-01
## zipcode98045   9.538591e+04 1.614728e+04   5.9072443  3.553896e-09
## zipcode98052   2.265845e+05 1.258278e+04  18.0075104  9.638243e-72
## zipcode98053   1.870783e+05 1.395032e+04  13.4103198  9.023470e-41
## zipcode98055   5.116958e+04 1.515835e+04   3.3756702  7.382247e-04
## zipcode98056   9.848097e+04 1.353154e+04   7.2778812  3.558802e-13
## zipcode98058   2.472558e+04 1.332055e+04   1.8561977  6.344493e-02
## zipcode98059   7.797852e+04 1.318519e+04   5.9140987  3.409363e-09
## zipcode98065   8.464771e+04 1.457970e+04   5.8058614  6.532775e-09
## zipcode98070  -2.165268e+04 2.048860e+04  -1.0568162  2.906125e-01
## zipcode98072   1.457262e+05 1.537029e+04   9.4810298  2.886966e-21
## zipcode98074   1.733778e+05 1.345891e+04  12.8820085  9.003940e-38
## zipcode98075   1.634132e+05 1.423560e+04  11.4791961  2.246152e-30
## zipcode98077   1.114651e+05 1.729453e+04   6.4451094  1.190359e-10
## zipcode98092  -4.422638e+04 1.396760e+04  -3.1663538  1.546712e-03
## zipcode98102   5.347875e+05 2.033989e+04  26.2925504 5.330192e-149
## zipcode98103   3.432605e+05 1.298871e+04  26.4275984 1.765582e-150
## zipcode98105   4.779966e+05 1.642659e+04  29.0989546 3.729089e-181
## zipcode98106   1.462768e+05 1.422247e+04  10.2849058  9.953885e-25
## zipcode98107   3.532205e+05 1.558383e+04  22.6658309 7.236634e-112
## zipcode98108   1.401566e+05 1.679476e+04   8.3452553  7.718683e-17
## zipcode98109   5.093232e+05 2.079209e+04  24.4960084 5.670383e-130
## zipcode98112   6.477565e+05 1.586311e+04  40.8341481  0.000000e+00
## zipcode98115   3.307959e+05 1.295432e+04  25.5355653 7.831948e-141
## zipcode98116   2.896976e+05 1.441392e+04  20.0984647  1.109008e-88
## zipcode98117   3.245267e+05 1.306834e+04  24.8330405 1.897406e-133
## zipcode98118   1.760176e+05 1.316840e+04  13.3666722  1.612718e-40
## zipcode98119   4.903171e+05 1.704172e+04  28.7715723 2.973298e-177
## zipcode98122   3.277835e+05 1.516656e+04  21.6122450 4.892767e-102
## zipcode98125   2.054630e+05 1.374591e+04  14.9472062  3.722160e-50
## zipcode98126   2.020696e+05 1.441188e+04  14.0210366  2.206120e-44
## zipcode98133   1.760766e+05 1.310704e+04  13.4337426  6.602613e-41
## zipcode98136   2.487188e+05 1.564508e+04  15.8975773  1.898624e-56
## zipcode98144   2.829259e+05 1.437106e+04  19.6871963  3.277888e-85
## zipcode98146   1.078151e+05 1.519594e+04   7.0949946  1.351350e-12
## zipcode98148   7.019725e+04 2.505260e+04   2.8019952  5.085258e-03
## zipcode98155   1.499300e+05 1.337509e+04  11.2096456  4.767267e-29
## zipcode98166   5.326520e+04 1.527909e+04   3.4861509  4.914225e-04
## zipcode98168   8.205105e+04 1.545726e+04   5.3082543  1.122529e-07
## zipcode98177   2.220050e+05 1.569874e+04  14.1415798  4.096550e-45
## zipcode98178   5.397625e+04 1.552788e+04   3.4760864  5.102244e-04
## zipcode98188   3.328017e+04 1.946646e+04   1.7096159  8.735755e-02
## zipcode98198   1.894341e+03 1.538887e+04   0.1230981  9.020310e-01
## zipcode98199   3.721736e+05 1.469502e+04  25.3265174 1.290913e-138
## sqft_living15  1.622906e+01 3.447751e+00   4.7071420  2.534547e-06
## year           7.148358e+04 8.790841e+03   8.1315971  4.565187e-16
## month02        3.976457e+03 8.136311e+03   0.4887296  6.250402e-01
## month03        2.933005e+04 7.479100e+03   3.9216019  8.835285e-05
## month04        3.224763e+04 7.295148e+03   4.4204220  9.919850e-06
## month05        5.658212e+04 9.650053e+03   5.8633996  4.629948e-09
## month06        6.765797e+04 1.142147e+04   5.9237538  3.215443e-09
## month07        6.243622e+04 1.141608e+04   5.4691474  4.594249e-08
## month08        6.344264e+04 1.152305e+04   5.5057172  3.736930e-08
## month09        6.025744e+04 1.160210e+04   5.1936676  2.088826e-07
## month10        6.316813e+04 1.156147e+04   5.4636784  4.737848e-08
## month11        6.222091e+04 1.184694e+04   5.2520641  1.524546e-07
## month12        6.290616e+04 1.176918e+04   5.3449934  9.174021e-08
## renovated      8.133441e+04 6.884406e+03  11.8142962  4.564736e-32
## age            6.523553e+02 7.721745e+01   8.4482885  3.224081e-17
# Matrix manipulation to extract estimates
my_estimates <- matrix_coef[ , 1]

# Step 6: Pr(>|t|) < 0.01 there is sufficient statistical evidence to reject 
# null for both parameters. Using a t-test we noted that t-values 
# (6.259865 and abs(4.102897) are larger than t-critical that is 2.637

Analysis: It indicates that many predictors have t-values significantly larger than the critical t-value of 2.576, suggesting that these variables have coefficients significantly different from zero. For instance, predictors such as bedrooms, bathrooms, and waterfront status exhibit large t-values and very small p-values, far below the 0.01 significance level, confirming their strong influence on house prices.

\(H_o\): Error variances are constant \(H_a\): Error variances are not constant Decision Rule is if statistic> critical reject the null or if p-value < \(\alpha\) (0.01) reject the hull

P value is < 2.2e-16. So we reject \(H_0\), Error variances are not constant.

The intercept, despite its negative coefficient, is significantly different from zero, which suggests that the base price for the model (the price when all other predictors are zero) is a meaningful value in the context of the dataset.

The analysis also highlights the complexity and the nuanced nature of real estate pricing, where a multitude of factors must be considered to accurately predict prices. The significant predictors identified through these t-tests can provide actionable insights for real estate valuation and investment strategies.

In summary, most predictors in the model contribute significantly to predicting house prices, with a few exceptions. These insights can guide future model refinement and the selection of variables for more robust predictive analytics.

Breusch-Pagan Test for Heteroskedasticity

The Breusch-Pagan test is a statistical test used to detect the presence of heteroskedasticity in a regression model. Heteroskedasticity occurs when the variance of the errors is not constant across all levels of the independent variables. When heteroskedasticity is present, it can lead to inefficiency of the estimates and can affect the validity of some statistical tests.

# Test the model for Heteroskedasticity
bptest(house_lm, studentize = FALSE)
## 
##  Breusch-Pagan test
## 
## data:  house_lm
## BP = 67014, df = 94, p-value < 2.2e-16

Analysis: The results of the Breusch-Pagan test show a BP statistic of 65708 with 82 degrees of freedom, and the p-value is less than 2.2e-16. This extremely low p-value indicates strong evidence against the null hypothesis of homoskedasticity (constant error variance). Thus, we reject the null hypothesis in favor of the alternative, which is that heteroskedasticity is present in the model. The presence of heteroskedasticity suggests that the error variance changes with the level of the independent variables, which may imply that the model could be improved. For instance, this could be addressed by transforming variables, adding variables to the model that capture the effect on variance, or using heteroskedasticity-consistent standard error estimators. In practical terms, this test suggests that the model’s predictive performance may vary across different values of the predictors, and care should be taken when interpreting the results, especially concerning the significance of the predictors and the prediction intervals for the house prices. Further investigation and potential model adjustments are warranted to address this issue.

Box-Cox Transformation

The Box-Cox transformation(“Box-Cox Transformations,” n.d.) is a statistical technique used to stabilize variance and make the data more closely adhere to the assumption of normality, which is a key consideration in linear regression modeling. This section of the analysis focuses on determining the optimal lambda parameter for the Box-Cox transformation of the response variable, which in this context is the house price. By conducting an initial wide search followed by a refined search for lambda, we aim to identify the value that maximizes the log-likelihood of our model, thereby indicating the most appropriate transformation for our dataset.

# Initial wide search for a Lambda value
bc_initial <- boxcox(house_lm, lambda=seq(-4, 4, by=0.1))

lambda_initial <- bc_initial$x[which.max(bc_initial$y)]
cat("The initial optimal lambda is:", lambda_initial, "\n")
## The initial optimal lambda is: 0.1212121

Analysis: The initial search for the optimal lambda within the range of -4 to 4 suggested a lambda of approximately 0.1212, indicating that a transformation is indeed necessary. To hone in on the most suitable value, a refined will be conducted next in search for a more refined value of lambda.

# Narrow down search based on initial results
lower_bound <- max(-4, lambda_initial - 0.5)
upper_bound <- min(4, lambda_initial + 0.5)

# More precise search around the initial estimate
bc_refined <- boxcox(house_lm, lambda=seq(lower_bound, upper_bound, by=0.01))

lambda_refined <- bc_refined$x[which.max(bc_refined$y)]
cat("The refined optimal lambda is:", lambda_refined, "\n")
## The refined optimal lambda is: 0.1012121
# Apply refined value to the variable to be used on the transformation
lambda <- lambda_refined

Analysis: The proximity of the optimal lambda to zero hints at the potential efficacy of a logarithmic transformation to improve model assumptions. However, the best lambda identified through the Box-Cox method is not exactly zero, suggesting that a Box-Cox transformation with this specific lambda may offer a better model fit than a straightforward logarithmic transformation. Applying this transformation to the house price is expected to yield residuals with more constant variance and a distribution that more closely approximates normality. These improvements are crucial for enhancing the reliability of the model’s predictions and the validity of its inferential statistics. The refined Box-Cox transformation with a lambda of 0.0912 should now be applied to the house price data, and the resulting model should be evaluated against the original model.

Final Model Evaluation

After applying the Box-Cox transformation to the dataset, a final model evaluation is essential to determine the impact of the transformation on the model’s performance. This evaluation will assess how the transformation has potentially improved the model’s adherence to the assumptions of linear regression, including linearity, normality of errors, homoscedasticity, and the absence of influential outliers. These characteristics are critical for the validity of the model’s statistical inferences and its predictive capabilities.

Summary of Transformed Model

This subsection provides a summary of the linear regression model fitted to the transformed response variable. The summary includes key metrics such as the coefficients, their standard errors, t-values, and significance levels. These metrics offer insight into the relationship between the predictors and the transformed house prices, highlighting which variables have a significant effect on the outcome. This summary is pivotal in understanding the dynamics of the real estate market as captured by the model.

# Reffit the Box-Cox transformed model
house_lm1<-lm(price^lambda~.,data=train.dat)
summary(house_lm1)
## 
## Call:
## lm(formula = price^lambda ~ ., data = train.dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49752 -0.03671  0.00276  0.03842  0.37428 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.984e+01  7.633e+00 -10.460  < 2e-16 ***
## bedrooms      -2.091e-04  7.684e-04  -0.272 0.785568    
## bathrooms      1.379e-02  1.330e-03  10.362  < 2e-16 ***
## sqft_lot       2.697e-07  1.590e-08  16.964  < 2e-16 ***
## floors        -1.337e-02  1.622e-03  -8.247  < 2e-16 ***
## waterfront     1.857e-01  7.555e-03  24.578  < 2e-16 ***
## view           2.401e-02  9.050e-04  26.531  < 2e-16 ***
## condition      2.333e-02  9.862e-04  23.654  < 2e-16 ***
## grade          3.444e-02  9.281e-04  37.107  < 2e-16 ***
## sqft_above     8.165e-05  1.552e-06  52.614  < 2e-16 ***
## sqft_basement  5.142e-05  1.808e-06  28.441  < 2e-16 ***
## zipcode98002  -1.045e-02  7.101e-03  -1.472 0.140988    
## zipcode98003   6.795e-03  6.546e-03   1.038 0.299259    
## zipcode98004   4.189e-01  6.370e-03  65.757  < 2e-16 ***
## zipcode98005   2.648e-01  7.578e-03  34.943  < 2e-16 ***
## zipcode98006   2.261e-01  5.754e-03  39.299  < 2e-16 ***
## zipcode98007   2.367e-01  8.146e-03  29.063  < 2e-16 ***
## zipcode98008   2.375e-01  6.530e-03  36.374  < 2e-16 ***
## zipcode98010   8.922e-02  9.209e-03   9.688  < 2e-16 ***
## zipcode98011   1.615e-01  7.333e-03  22.028  < 2e-16 ***
## zipcode98014   1.206e-01  8.773e-03  13.751  < 2e-16 ***
## zipcode98019   1.203e-01  7.215e-03  16.671  < 2e-16 ***
## zipcode98022   9.192e-03  6.957e-03   1.321 0.186456    
## zipcode98023  -1.211e-02  5.632e-03  -2.151 0.031519 *  
## zipcode98024   1.430e-01  1.029e-02  13.900  < 2e-16 ***
## zipcode98027   1.905e-01  5.878e-03  32.412  < 2e-16 ***
## zipcode98028   1.518e-01  6.609e-03  22.963  < 2e-16 ***
## zipcode98029   2.201e-01  6.247e-03  35.238  < 2e-16 ***
## zipcode98030   1.498e-02  6.643e-03   2.255 0.024123 *  
## zipcode98031   2.343e-02  6.481e-03   3.615 0.000302 ***
## zipcode98032  -7.946e-03  8.574e-03  -0.927 0.354036    
## zipcode98033   2.877e-01  5.860e-03  49.100  < 2e-16 ***
## zipcode98034   1.993e-01  5.532e-03  36.021  < 2e-16 ***
## zipcode98038   5.937e-02  5.396e-03  11.002  < 2e-16 ***
## zipcode98039   4.914e-01  1.262e-02  38.938  < 2e-16 ***
## zipcode98040   3.181e-01  6.641e-03  47.893  < 2e-16 ***
## zipcode98042   1.735e-02  5.490e-03   3.160 0.001583 ** 
## zipcode98045   1.200e-01  6.958e-03  17.253  < 2e-16 ***
## zipcode98052   2.337e-01  5.422e-03  43.101  < 2e-16 ***
## zipcode98053   2.075e-01  6.011e-03  34.524  < 2e-16 ***
## zipcode98055   4.813e-02  6.532e-03   7.368 1.82e-13 ***
## zipcode98056   1.120e-01  5.831e-03  19.214  < 2e-16 ***
## zipcode98058   5.226e-02  5.740e-03   9.105  < 2e-16 ***
## zipcode98059   1.155e-01  5.681e-03  20.321  < 2e-16 ***
## zipcode98065   1.380e-01  6.282e-03  21.963  < 2e-16 ***
## zipcode98070   1.076e-01  8.829e-03  12.184  < 2e-16 ***
## zipcode98072   1.775e-01  6.623e-03  26.806  < 2e-16 ***
## zipcode98074   2.022e-01  5.799e-03  34.862  < 2e-16 ***
## zipcode98075   1.945e-01  6.134e-03  31.711  < 2e-16 ***
## zipcode98077   1.558e-01  7.452e-03  20.902  < 2e-16 ***
## zipcode98092   1.133e-03  6.019e-03   0.188 0.850713    
## zipcode98102   3.578e-01  8.764e-03  40.820  < 2e-16 ***
## zipcode98103   2.994e-01  5.597e-03  53.487  < 2e-16 ***
## zipcode98105   3.532e-01  7.078e-03  49.902  < 2e-16 ***
## zipcode98106   1.114e-01  6.128e-03  18.170  < 2e-16 ***
## zipcode98107   3.053e-01  6.715e-03  45.471  < 2e-16 ***
## zipcode98108   1.359e-01  7.237e-03  18.773  < 2e-16 ***
## zipcode98109   3.695e-01  8.959e-03  41.243  < 2e-16 ***
## zipcode98112   3.993e-01  6.835e-03  58.422  < 2e-16 ***
## zipcode98115   3.003e-01  5.582e-03  53.805  < 2e-16 ***
## zipcode98116   2.739e-01  6.211e-03  44.092  < 2e-16 ***
## zipcode98117   2.967e-01  5.631e-03  52.693  < 2e-16 ***
## zipcode98118   1.627e-01  5.674e-03  28.671  < 2e-16 ***
## zipcode98119   3.695e-01  7.343e-03  50.313  < 2e-16 ***
## zipcode98122   2.916e-01  6.535e-03  44.614  < 2e-16 ***
## zipcode98125   2.085e-01  5.923e-03  35.206  < 2e-16 ***
## zipcode98126   1.942e-01  6.210e-03  31.267  < 2e-16 ***
## zipcode98133   1.659e-01  5.648e-03  29.378  < 2e-16 ***
## zipcode98136   2.463e-01  6.741e-03  36.539  < 2e-16 ***
## zipcode98144   2.398e-01  6.192e-03  38.716  < 2e-16 ***
## zipcode98146   9.790e-02  6.548e-03  14.951  < 2e-16 ***
## zipcode98148   5.225e-02  1.080e-02   4.840 1.31e-06 ***
## zipcode98155   1.520e-01  5.763e-03  26.379  < 2e-16 ***
## zipcode98166   1.040e-01  6.584e-03  15.794  < 2e-16 ***
## zipcode98168   3.274e-02  6.661e-03   4.916 8.92e-07 ***
## zipcode98177   2.192e-01  6.765e-03  32.408  < 2e-16 ***
## zipcode98178   5.086e-02  6.691e-03   7.601 3.10e-14 ***
## zipcode98188   2.898e-02  8.388e-03   3.455 0.000551 ***
## zipcode98198   2.377e-02  6.631e-03   3.585 0.000338 ***
## zipcode98199   3.133e-01  6.332e-03  49.481  < 2e-16 ***
## sqft_living15  3.224e-05  1.486e-06  21.699  < 2e-16 ***
## year           4.110e-02  3.788e-03  10.851  < 2e-16 ***
## month02        5.539e-03  3.506e-03   1.580 0.114179    
## month03        1.840e-02  3.223e-03   5.710 1.15e-08 ***
## month04        2.504e-02  3.143e-03   7.965 1.78e-15 ***
## month05        3.277e-02  4.158e-03   7.882 3.45e-15 ***
## month06        4.042e-02  4.922e-03   8.214 2.32e-16 ***
## month07        3.761e-02  4.919e-03   7.646 2.19e-14 ***
## month08        3.867e-02  4.965e-03   7.788 7.26e-15 ***
## month09        3.934e-02  4.999e-03   7.868 3.85e-15 ***
## month10        3.561e-02  4.982e-03   7.147 9.27e-13 ***
## month11        3.353e-02  5.105e-03   6.568 5.26e-11 ***
## month12        3.767e-02  5.071e-03   7.429 1.15e-13 ***
## renovated      4.067e-02  2.966e-03  13.711  < 2e-16 ***
## age            1.217e-04  3.327e-05   3.657 0.000256 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06866 on 15034 degrees of freedom
## Multiple R-squared:  0.8847, Adjusted R-squared:  0.8839 
## F-statistic:  1227 on 94 and 15034 DF,  p-value: < 2.2e-16

Analysis: The transformed model exhibits a significant reduction in the residuals’ magnitude, with the majority now lying closer to zero, which suggests an improved fit of the model. The coefficients for bathrooms, sqft_lot, waterfront, view, condition, grade, sqft_above, and sqft_basement remain significant, indicating a consistent impact on the transformed house prices. The substantial F-statistic and near-zero p-value confirm the model’s overall statistical significance. However, the bedrooms variable has become insignificant, implying that its effect on the transformed price is minimal. The multiple R-squared of 0.8838 is quite high, signifying that the transformed model explains a large portion of the variance in the data. Overall, the Box-Cox transformation appears to have enhanced the model’s explanatory power.

Diagnostic Plots of Transformed Model

Diagnostic plots are instrumental in visually assessing the model’s compliance with regression assumptions. This subsection will exhibit a series of plots, including Residuals vs. Fitted, Normal Q-Q, Scale-Location, and Residuals vs. Leverage. Each plot serves a distinct purpose, from detecting non-linearity and outliers to checking for equal variance and the influence of individual data points. A close examination of these plots will reveal any remaining anomalies post-transformation and guide any further refinement needed for the model.

# Plot the BoxCox Transformed Model
par(mfrow=c(2,2))
plot(house_lm1)

Analysis: The diagnostic plots suggest that while the Box-Cox transformation has likely improved the model fit, there are still some areas where the model does not perfectly meet the assumptions of linear regression. There may be potential outliers or influential points that need to be investigated further. Additionally, the presence of heteroscedasticity might be addressed with additional transformations or by considering a different type of model, such as generalized least squares. It would also be prudent to examine the leverage points more closely to determine if any action, such as exclusion or weighting, is necessary.

After boxcox transformation, we observe that: - Residual Standard Error has reduced significantly to 0.1072 from 160,300 - Multiple R-square has increased to 88.5% from 80.45% - p-value remains same at < 2.2e-16

Overall, the model appears to be a much better fit that before. From the model plot we observe the following: - LINEARITY ASSUMPTION: The Residuals vs Fitted plot looks better than before and confirms that a linear regression model is appropriate. - NORMALITY ASSUMPTION: The points are much better aligned along the diagonal in the Q-Q Residuals plot, however some heavy tails remain on both ends. - CONSTANT VARIANCE ASSUMPTION: The Scale Location plot points to constant variance - There are still some outlier observations which may be to be impactful as seen from the the Residual vs. Leverage graph

Display of Transformed Regression Equation

The transformation’s efficacy is encapsulated in the regression equation of the transformed model. This subsection will present the regression function, explicitly showing the transformed relationship between the predictors and the outcome. By interpreting this function, we can understand how the Box-Cox transformation affects the scale and interactions of the variables within the model, thus providing a comprehensive view of the model’s structure after the transformation.

# Display the Regression function of Price^Lambda
output <- capture.output(dispRegFunc(house_lm1))

# Print the output of dispRegFunc wrapped in max 80 char width
cat(paste(strwrap(output, width=80), collapse="\n"))
## [1] "Y = -79.841772 + -0.000209bedrooms + 0.013786bathrooms + 0sqft_lot +
## -0.013374floors + 0.185676waterfront + 0.024009view + 0.023329condition +
## 0.034439grade + 8.2e-05sqft_above + 5.1e-05sqft_basement +
## -0.010454zipcode98002 + 0.006795zipcode98003 + 0.418872zipcode98004 +
## 0.26478zipcode98005 + 0.226109zipcode98006 + 0.23673zipcode98007 +
## 0.237536zipcode98008 + 0.089219zipcode98010 + 0.161518zipcode98011 +
## 0.120633zipcode98014 + 0.12029zipcode98019 + 0.009192zipcode98022 +
## -0.012113zipcode98023 + 0.143048zipcode98024 + 0.190517zipcode98027 +
## 0.151772zipcode98028 + 0.220127zipcode98029 + 0.014982zipcode98030 +
## 0.023426zipcode98031 + -0.007946zipcode98032 + 0.287716zipcode98033 +
## 0.199279zipcode98034 + 0.059374zipcode98038 + 0.491384zipcode98039 +
## 0.318069zipcode98040 + 0.017346zipcode98042 + 0.120045zipcode98045 +
## 0.233691zipcode98052 + 0.207533zipcode98053 + 0.048128zipcode98055 +
## 0.112033zipcode98056 + 0.052259zipcode98058 + 0.115452zipcode98059 +
## 0.137977zipcode98065 + 0.10757zipcode98070 + 0.177535zipcode98072 +
## 0.202178zipcode98074 + 0.194519zipcode98075 + 0.155767zipcode98077 +
## 0.001133zipcode98092 + 0.357768zipcode98102 + 0.299359zipcode98103 +
## 0.353214zipcode98105 + 0.111356zipcode98106 + 0.305338zipcode98107 +
## 0.135858zipcode98108 + 0.369512zipcode98109 + 0.399335zipcode98112 +
## 0.300342zipcode98115 + 0.273855zipcode98116 + 0.296719zipcode98117 +
## 0.162685zipcode98118 + 0.369459zipcode98119 + 0.291563zipcode98122 +
## 0.208532zipcode98125 + 0.194167zipcode98126 + 0.165921zipcode98133 +
## 0.246328zipcode98136 + 0.23975zipcode98144 + 0.097897zipcode98146 +
## 0.052252zipcode98148 + 0.15203zipcode98155 + 0.103981zipcode98166 +
## 0.032744zipcode98168 + 0.219225zipcode98177 + 0.050861zipcode98178 +
## 0.028984zipcode98188 + 0.023774zipcode98198 + 0.31332zipcode98199 +
## 3.2e-05sqft_living15 + 0.041103year + 0.005539month02 + 0.018401month03 +
## 0.025036month04 + 0.032773month05 + 0.040424month06 + 0.037613month07 +
## 0.038669month08 + 0.039335month09 + 0.035606month10 + 0.03353month11 +
## 0.037675month12 + 0.040675renovated + 0.000122age"

Predictions on Test Data and Model Comparison

This section is designed to evaluate how well the Box-Cox transformed model performs on unseen data and to compare this performance with the original model.

PredictedTest<-exp(predict(house_lm1,test.dat))
ModelTest<-data.frame(obs = test.dat$price, pred=PredictedTest)
round(defaultSummary(ModelTest),3)
##       RMSE   Rsquared        MAE 
## 662683.979      0.844 543252.035
# Test BoxCox model predictions
pred <- predict(house_lm1,test.dat)^(1/lambda)
act <- test.dat$price

house_lm1_test_mse <- mean((pred - act)^2)
house_lm1_test_rmse <- sqrt(house_lm1_test_mse)
house_lm1_test_residuals <- act - pred
house_lm1_test_rsq <- 1 - var(house_lm1_test_residuals) / var(act)
house_lm1_test_sse <- sum((act - pred)^2)

# Append results after BoxCox Transformation
results.df = rbind(results.df, 
                   data.frame(model = "Linear Regression Test after Boxcox",
                              R.Squared.Train = summary(house_lm1)$r.square,
                              R.Squared.Test = house_lm1_test_rsq,
                              RMSE.test = house_lm1_test_rmse,
                              SSE.test = house_lm1_test_sse))

print(results.df)
##                                                     model R.Squared.Train
## 1                 Linear Regression Test Data Predictions       0.8071138
## 2 Linear Regression Test without Insignificant Predictors       0.8070843
## 3                     Linear Regression Test after Boxcox       0.8846619
##   R.Squared.Test RMSE.test     SSE.test
## 1      0.8125711  164316.8 1.750680e+14
## 2      0.8124792  164357.6 1.751550e+14
## 3      0.8286341  157271.9 1.603782e+14

Analysis: The output indicates an improvement in the model’s performance on the test data after applying the Box-Cox transformation. The R-squared value for the test data increased minimally from 0.8275312 to 0.8282285, indicating the same fit to the test data. Similarly, the RMSE decreased minimally as well, suggesting more accurate predictions from the transformed model. The SSE also decreased, further confirming the enhanced predictive performance post-transformation. Overall, the Box-Cox transformation has led to a model that explains a greater proportion of variance in house prices and predicts them with higher accuracy, as evidenced by the improved R-squared and lower RMSE and SSE values. This demonstrates the effectiveness of the transformation in stabilizing variance and improving model fit for the data at hand.

IV. Model Performance Testing

Use the test data set to assess the model performances. Here, build the best multiple linear models by using the stepwise both ways selection method. Compare the performance of the best two linear models. Make sure that model assumption(s) are checked for the final linear model. Apply remedy measures (transformation, etc.) that helps satisfy the assumptions. In particular you must deeply investigate unequal variances and multicollinearity. If necessary, apply remedial methods (WLS, Ridge, Elastic Net, Lasso, etc.).

Analysis:

The stepwise method yields a similar model with the elimination of sqft_lot15 being the only difference. This variable was insignificant in the Boxcox transformed model. The r-squared value remains nearly the same at 0.77 with all the predictor variables being significant at 0.001.

Graph 1 (Residuals vs Fitted): (linearity assumption) A linear relationship seems appropriate. (The average mean error equal to zero assumption) The average mean seems to be equal to zero.

Graph 2 (Q-Q Residuals): (normality assumption) There is minor departure from a normal distribution at the tails.

Graph 3 (Scale-Location): (constant variance assumption) The residuals do not seem to be equally distributed throughout. (homoscedasticity assumption) The variances do not appear constant and this indicated through the Breush-Pagan.

Graph 4 (Residuals vs Leverage): There are no influential outliers.

# Fit Stepwise model
stepwise_model <- step(house_lm1, direction = "both", k = log(nrow(train.dat)))
## Start:  AIC=-80227.56
## price^lambda ~ bedrooms + bathrooms + sqft_lot + floors + waterfront + 
##     view + condition + grade + sqft_above + sqft_basement + zipcode + 
##     sqft_living15 + year + month + renovated + age
## 
##                 Df Sum of Sq     RSS    AIC
## - bedrooms       1     0.000  70.884 -80237
## <none>                        70.883 -80228
## - age            1     0.063  70.946 -80224
## - month         11     0.593  71.476 -80207
## - floors         1     0.321  71.204 -80169
## - bathrooms      1     0.506  71.390 -80130
## - year           1     0.555  71.438 -80119
## - renovated      1     0.886  71.770 -80049
## - sqft_lot       1     1.357  72.240 -79950
## - sqft_living15  1     2.220  73.103 -79771
## - condition      1     2.638  73.521 -79684
## - waterfront     1     2.848  73.731 -79641
## - view           1     3.319  74.202 -79545
## - sqft_basement  1     3.814  74.697 -79444
## - grade          1     6.492  77.375 -78911
## - sqft_above     1    13.052  83.935 -77680
## - zipcode       69   135.673 206.556 -64711
## 
## Step:  AIC=-80237.11
## price^lambda ~ bathrooms + sqft_lot + floors + waterfront + view + 
##     condition + grade + sqft_above + sqft_basement + zipcode + 
##     sqft_living15 + year + month + renovated + age
## 
##                 Df Sum of Sq     RSS    AIC
## <none>                        70.884 -80237
## - age            1     0.063  70.946 -80233
## + bedrooms       1     0.000  70.883 -80228
## - month         11     0.593  71.476 -80217
## - floors         1     0.321  71.204 -80178
## - bathrooms      1     0.522  71.405 -80136
## - year           1     0.556  71.439 -80129
## - renovated      1     0.886  71.770 -80059
## - sqft_lot       1     1.366  72.250 -79958
## - sqft_living15  1     2.222  73.106 -79780
## - condition      1     2.638  73.522 -79694
## - waterfront     1     2.856  73.739 -79649
## - view           1     3.339  74.222 -79550
## - sqft_basement  1     4.081  74.965 -79400
## - grade          1     6.587  77.471 -78902
## - sqft_above     1    14.183  85.067 -77487
## - zipcode       69   136.459 207.342 -64663
sw_model <- lm(formula = stepwise_model$model, data = train.dat)

# Summarize the Stepwise model
summary(sw_model)
## 
## Call:
## lm(formula = stepwise_model$model, data = train.dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49741 -0.03668  0.00280  0.03845  0.37447 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.987e+01  7.632e+00 -10.466  < 2e-16 ***
## bathrooms      1.371e-02  1.304e-03  10.519  < 2e-16 ***
## sqft_lot       2.700e-07  1.586e-08  17.021  < 2e-16 ***
## floors        -1.338e-02  1.622e-03  -8.249  < 2e-16 ***
## waterfront     1.858e-01  7.548e-03  24.612  < 2e-16 ***
## view           2.403e-02  9.029e-04  26.611  < 2e-16 ***
## condition      2.332e-02  9.861e-04  23.654  < 2e-16 ***
## grade          3.447e-02  9.221e-04  37.378  < 2e-16 ***
## sqft_above     8.153e-05  1.486e-06  54.848  < 2e-16 ***
## sqft_basement  5.129e-05  1.743e-06  29.421  < 2e-16 ***
## zipcode98002  -1.044e-02  7.101e-03  -1.470 0.141489    
## zipcode98003   6.801e-03  6.546e-03   1.039 0.298797    
## zipcode98004   4.189e-01  6.369e-03  65.765  < 2e-16 ***
## zipcode98005   2.648e-01  7.577e-03  34.944  < 2e-16 ***
## zipcode98006   2.261e-01  5.753e-03  39.305  < 2e-16 ***
## zipcode98007   2.367e-01  8.144e-03  29.063  < 2e-16 ***
## zipcode98008   2.375e-01  6.528e-03  36.380  < 2e-16 ***
## zipcode98010   8.926e-02  9.208e-03   9.693  < 2e-16 ***
## zipcode98011   1.615e-01  7.332e-03  22.031  < 2e-16 ***
## zipcode98014   1.207e-01  8.765e-03  13.775  < 2e-16 ***
## zipcode98019   1.203e-01  7.214e-03  16.680  < 2e-16 ***
## zipcode98022   9.217e-03  6.956e-03   1.325 0.185183    
## zipcode98023  -1.211e-02  5.632e-03  -2.150 0.031572 *  
## zipcode98024   1.431e-01  1.029e-02  13.911  < 2e-16 ***
## zipcode98027   1.906e-01  5.876e-03  32.430  < 2e-16 ***
## zipcode98028   1.518e-01  6.609e-03  22.966  < 2e-16 ***
## zipcode98029   2.202e-01  6.245e-03  35.252  < 2e-16 ***
## zipcode98030   1.498e-02  6.642e-03   2.255 0.024170 *  
## zipcode98031   2.342e-02  6.481e-03   3.614 0.000303 ***
## zipcode98032  -7.956e-03  8.573e-03  -0.928 0.353423    
## zipcode98033   2.877e-01  5.859e-03  49.113  < 2e-16 ***
## zipcode98034   1.993e-01  5.532e-03  36.022  < 2e-16 ***
## zipcode98038   5.940e-02  5.395e-03  11.009  < 2e-16 ***
## zipcode98039   4.915e-01  1.261e-02  38.960  < 2e-16 ***
## zipcode98040   3.181e-01  6.641e-03  47.894  < 2e-16 ***
## zipcode98042   1.736e-02  5.489e-03   3.162 0.001572 ** 
## zipcode98045   1.201e-01  6.957e-03  17.260  < 2e-16 ***
## zipcode98052   2.337e-01  5.422e-03  43.105  < 2e-16 ***
## zipcode98053   2.076e-01  6.001e-03  34.600  < 2e-16 ***
## zipcode98055   4.816e-02  6.530e-03   7.375 1.72e-13 ***
## zipcode98056   1.120e-01  5.830e-03  19.217  < 2e-16 ***
## zipcode98058   5.225e-02  5.740e-03   9.104  < 2e-16 ***
## zipcode98059   1.154e-01  5.681e-03  20.320  < 2e-16 ***
## zipcode98065   1.380e-01  6.278e-03  21.989  < 2e-16 ***
## zipcode98070   1.076e-01  8.824e-03  12.199  < 2e-16 ***
## zipcode98072   1.776e-01  6.621e-03  26.819  < 2e-16 ***
## zipcode98074   2.022e-01  5.799e-03  34.869  < 2e-16 ***
## zipcode98075   1.945e-01  6.134e-03  31.716  < 2e-16 ***
## zipcode98077   1.558e-01  7.450e-03  20.913  < 2e-16 ***
## zipcode98092   1.136e-03  6.018e-03   0.189 0.850332    
## zipcode98102   3.579e-01  8.756e-03  40.871  < 2e-16 ***
## zipcode98103   2.994e-01  5.593e-03  53.528  < 2e-16 ***
## zipcode98105   3.532e-01  7.078e-03  49.905  < 2e-16 ***
## zipcode98106   1.114e-01  6.128e-03  18.177  < 2e-16 ***
## zipcode98107   3.054e-01  6.709e-03  45.522  < 2e-16 ***
## zipcode98108   1.359e-01  7.236e-03  18.781  < 2e-16 ***
## zipcode98109   3.696e-01  8.954e-03  41.278  < 2e-16 ***
## zipcode98112   3.994e-01  6.829e-03  58.486  < 2e-16 ***
## zipcode98115   3.004e-01  5.579e-03  53.848  < 2e-16 ***
## zipcode98116   2.739e-01  6.205e-03  44.148  < 2e-16 ***
## zipcode98117   2.968e-01  5.626e-03  52.751  < 2e-16 ***
## zipcode98118   1.627e-01  5.672e-03  28.690  < 2e-16 ***
## zipcode98119   3.695e-01  7.340e-03  50.345  < 2e-16 ***
## zipcode98122   2.916e-01  6.532e-03  44.643  < 2e-16 ***
## zipcode98125   2.085e-01  5.923e-03  35.212  < 2e-16 ***
## zipcode98126   1.942e-01  6.203e-03  31.315  < 2e-16 ***
## zipcode98133   1.659e-01  5.647e-03  29.386  < 2e-16 ***
## zipcode98136   2.464e-01  6.736e-03  36.582  < 2e-16 ***
## zipcode98144   2.398e-01  6.187e-03  38.760  < 2e-16 ***
## zipcode98146   9.792e-02  6.547e-03  14.957  < 2e-16 ***
## zipcode98148   5.229e-02  1.079e-02   4.844 1.28e-06 ***
## zipcode98155   1.520e-01  5.763e-03  26.381  < 2e-16 ***
## zipcode98166   1.040e-01  6.583e-03  15.796  < 2e-16 ***
## zipcode98168   3.278e-02  6.659e-03   4.922 8.64e-07 ***
## zipcode98177   2.193e-01  6.763e-03  32.420  < 2e-16 ***
## zipcode98178   5.086e-02  6.691e-03   7.601 3.11e-14 ***
## zipcode98188   2.898e-02  8.388e-03   3.455 0.000552 ***
## zipcode98198   2.378e-02  6.631e-03   3.587 0.000336 ***
## zipcode98199   3.134e-01  6.324e-03  49.556  < 2e-16 ***
## sqft_living15  3.225e-05  1.485e-06  21.712  < 2e-16 ***
## year           4.112e-02  3.787e-03  10.857  < 2e-16 ***
## month02        5.542e-03  3.506e-03   1.581 0.113942    
## month03        1.839e-02  3.223e-03   5.708 1.17e-08 ***
## month04        2.503e-02  3.143e-03   7.964 1.78e-15 ***
## month05        3.279e-02  4.158e-03   7.885 3.35e-15 ***
## month06        4.043e-02  4.921e-03   8.216 2.27e-16 ***
## month07        3.763e-02  4.919e-03   7.650 2.14e-14 ***
## month08        3.869e-02  4.965e-03   7.793 6.97e-15 ***
## month09        3.935e-02  4.999e-03   7.873 3.70e-15 ***
## month10        3.563e-02  4.981e-03   7.152 8.95e-13 ***
## month11        3.355e-02  5.104e-03   6.573 5.10e-11 ***
## month12        3.769e-02  5.071e-03   7.432 1.12e-13 ***
## renovated      4.067e-02  2.966e-03  13.710  < 2e-16 ***
## age            1.209e-04  3.316e-05   3.647 0.000266 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06866 on 15035 degrees of freedom
## Multiple R-squared:  0.8847, Adjusted R-squared:  0.8839 
## F-statistic:  1240 on 93 and 15035 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(sw_model)

library(olsrr)
library(datasets)

k <- ols_step_forward_aic(house_lm1)
k
## 
##                             Selection Summary                             
## -------------------------------------------------------------------------
## Variable            AIC        Sum Sq       RSS       R-Sq      Adj. R-Sq 
## -------------------------------------------------------------------------
## zipcode          -16664.492    322.903    291.667    0.52541      0.52324 
## sqft_above       -28057.410    477.234    137.336    0.77653      0.77549 
## sqft_basement    -32212.160    510.228    104.342    0.83022      0.82942 
## view             -34072.496    522.313     92.257    0.84988      0.84917 
## grade            -35461.048    530.415     84.155    0.86307      0.86240 
## condition        -36021.789    533.487     81.083    0.86807      0.86742 
## waterfront       -36542.798    536.243     78.327    0.87255      0.87191 
## sqft_living15    -36991.480    538.541     76.029    0.87629      0.87567 
## sqft_lot         -37306.388    540.117     74.453    0.87885      0.87823 
## year             -37582.377    541.473     73.097    0.88106      0.88044 
## renovated        -37762.385    542.347     72.223    0.88248      0.88187 
## month            -37863.205    542.931     71.639    0.88343      0.88273 
## floors           -37910.579    543.165     71.406    0.88381      0.88311 
## bathrooms        -38006.177    543.624     70.946    0.88456      0.88385 
## age              -38017.555    543.686     70.884    0.88466      0.88395 
## -------------------------------------------------------------------------

Analysis:

The WLS regression model makes the error variances more equally distributed throughout. The adjusted r-squared value increased to 0.79 with almost all of the independent variables being significant.

Graph 1 (Residuals vs Fitted): (linearity assumption) A linear relationship seems appropriate. (The average mean error equal to zero assumption) The average mean seems to be equal to zero.

Graph 2 (Q-Q Residuals): (normality assumption) There is minor departure from a normal distribution at the tails.

Graph 3 (Scale-Location): (constant variance assumption) The residuals seem to be more equally distributed throughout compared to the stepwise model. (homoscedasticity assumption) The variances appear to be slightly more constant.

Graph 4 (Residuals vs Leverage): There now appears to be several influential outliers beyond the Cook’s distance line.

# Weighted Least Squares Regression
# We notice from the scale-location graph of the main model that the error variances are unequal.

# Calculating the weights for the model
ei <- house_lm$residuals
abs.ei <- abs(ei)
g1 <- lm(abs.ei ~ train.dat$price)
summary(g1)
## 
## Call:
## lm(formula = abs.ei ~ train.dat$price)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -394181  -53388  -11803   40250 2828319 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.859e+04  1.482e+03  -12.54   <2e-16 ***
## train.dat$price  2.113e-01  2.284e-03   92.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101600 on 15127 degrees of freedom
## Multiple R-squared:  0.3613, Adjusted R-squared:  0.3613 
## F-statistic:  8559 on 1 and 15127 DF,  p-value: < 2.2e-16
s <- g1$fitted.values
wi = 1/(s^2)

# Weighted-least squares regression
house_lm_wls <- lm(price ~ ., weights = wi, data = train.dat)

summary(house_lm_wls)
## 
## Call:
## lm(formula = price ~ ., data = train.dat, weights = wi)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.9411  -0.1514   0.5370   1.2559  19.6943 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.002e+07  6.586e+06 -10.632  < 2e-16 ***
## bedrooms      -5.524e+03  8.040e+02  -6.870 6.66e-12 ***
## bathrooms      1.744e+04  1.527e+03  11.416  < 2e-16 ***
## sqft_lot       2.234e-01  2.075e-02  10.766  < 2e-16 ***
## floors        -1.038e+04  1.756e+03  -5.914 3.42e-09 ***
## waterfront     1.836e+05  2.378e+04   7.718 1.26e-14 ***
## view           3.399e+04  1.868e+03  18.195  < 2e-16 ***
## condition      2.081e+04  7.381e+02  28.198  < 2e-16 ***
## grade          1.950e+04  8.201e+02  23.784  < 2e-16 ***
## sqft_above     1.152e+02  2.212e+00  52.101  < 2e-16 ***
## sqft_basement  9.369e+01  2.576e+00  36.366  < 2e-16 ***
## zipcode98002   1.475e+04  3.620e+03   4.074 4.64e-05 ***
## zipcode98003   2.024e+04  4.425e+03   4.574 4.82e-06 ***
## zipcode98004   6.051e+05  2.009e+04  30.113  < 2e-16 ***
## zipcode98005   3.483e+05  1.843e+04  18.897  < 2e-16 ***
## zipcode98006   2.536e+05  9.787e+03  25.915  < 2e-16 ***
## zipcode98007   2.546e+05  1.468e+04  17.346  < 2e-16 ***
## zipcode98008   2.354e+05  1.022e+04  23.035  < 2e-16 ***
## zipcode98010   5.922e+04  7.661e+03   7.729 1.15e-14 ***
## zipcode98011   1.590e+05  1.027e+04  15.481  < 2e-16 ***
## zipcode98014   1.181e+05  3.931e+03  30.033  < 2e-16 ***
## zipcode98019   1.001e+05  7.979e+03  12.551  < 2e-16 ***
## zipcode98022   4.197e+04  4.130e+03  10.164  < 2e-16 ***
## zipcode98023  -4.395e+03  2.855e+03  -1.539 0.123766    
## zipcode98024   1.402e+05  1.263e+04  11.104  < 2e-16 ***
## zipcode98027   1.793e+05  7.953e+03  22.542  < 2e-16 ***
## zipcode98028   1.169e+05  7.615e+03  15.349  < 2e-16 ***
## zipcode98029   2.372e+05  1.000e+04  23.719  < 2e-16 ***
## zipcode98030   1.193e+04  4.471e+03   2.668 0.007638 ** 
## zipcode98031   2.767e+04  4.830e+03   5.729 1.03e-08 ***
## zipcode98032   1.061e+04  2.754e+03   3.853 0.000117 ***
## zipcode98033   2.420e+05  8.830e+03  27.405  < 2e-16 ***
## zipcode98034   6.404e+03  2.872e+03   2.229 0.025800 *  
## zipcode98038   4.317e+04  4.036e+03  10.697  < 2e-16 ***
## zipcode98039   9.689e+05  7.579e+04  12.784  < 2e-16 ***
## zipcode98040   4.990e+05  2.007e+04  24.867  < 2e-16 ***
## zipcode98042   2.106e+04  3.304e+03   6.374 1.90e-10 ***
## zipcode98045   1.030e+05  6.825e+03  15.096  < 2e-16 ***
## zipcode98052   2.531e+05  7.848e+03  32.253  < 2e-16 ***
## zipcode98053   1.963e+05  8.789e+03  22.330  < 2e-16 ***
## zipcode98055   3.905e+04  3.571e+03  10.934  < 2e-16 ***
## zipcode98056   8.582e+04  4.649e+03  18.460  < 2e-16 ***
## zipcode98058   1.838e+04  3.089e+03   5.952 2.71e-09 ***
## zipcode98059   9.020e+04  5.657e+03  15.945  < 2e-16 ***
## zipcode98065   1.299e+05  7.603e+03  17.087  < 2e-16 ***
## zipcode98070   8.159e+04  9.831e+03   8.299  < 2e-16 ***
## zipcode98072   1.749e+05  9.608e+03  18.199  < 2e-16 ***
## zipcode98074   2.349e+05  9.332e+03  25.174  < 2e-16 ***
## zipcode98075   2.725e+05  1.262e+04  21.590  < 2e-16 ***
## zipcode98077   1.735e+05  1.244e+04  13.944  < 2e-16 ***
## zipcode98092  -4.278e+02  3.236e+03  -0.132 0.894823    
## zipcode98102   3.697e+05  1.905e+04  19.404  < 2e-16 ***
## zipcode98103   2.612e+05  6.669e+03  39.166  < 2e-16 ***
## zipcode98105   3.937e+05  1.482e+04  26.560  < 2e-16 ***
## zipcode98106   6.279e+04  3.368e+03  18.645  < 2e-16 ***
## zipcode98107   3.117e+05  1.037e+04  30.043  < 2e-16 ***
## zipcode98108   3.479e+04  3.356e+03  10.367  < 2e-16 ***
## zipcode98109   2.755e+05  1.815e+04  15.183  < 2e-16 ***
## zipcode98112   3.238e+05  1.421e+04  22.779  < 2e-16 ***
## zipcode98115   2.904e+05  7.360e+03  39.456  < 2e-16 ***
## zipcode98116   1.926e+05  7.695e+03  25.030  < 2e-16 ***
## zipcode98117   2.344e+05  6.479e+03  36.183  < 2e-16 ***
## zipcode98118   1.025e+05  4.243e+03  24.160  < 2e-16 ***
## zipcode98119   3.977e+05  1.596e+04  24.913  < 2e-16 ***
## zipcode98122   2.650e+05  9.277e+03  28.564  < 2e-16 ***
## zipcode98125   1.844e+05  6.452e+03  28.589  < 2e-16 ***
## zipcode98126   1.318e+05  4.700e+03  28.042  < 2e-16 ***
## zipcode98133   1.419e+05  4.813e+03  29.488  < 2e-16 ***
## zipcode98136   2.228e+05  8.923e+03  24.970  < 2e-16 ***
## zipcode98144   1.908e+05  7.076e+03  26.960  < 2e-16 ***
## zipcode98146   3.571e+04  3.039e+03  11.752  < 2e-16 ***
## zipcode98148   5.756e+04  5.009e+03  11.490  < 2e-16 ***
## zipcode98155   1.350e+05  5.154e+03  26.187  < 2e-16 ***
## zipcode98166   4.154e+04  4.033e+03  10.300  < 2e-16 ***
## zipcode98168   3.642e+04  2.883e+03  12.633  < 2e-16 ***
## zipcode98177   2.119e+05  1.011e+04  20.967  < 2e-16 ***
## zipcode98178   6.413e+04  2.828e+03  22.674  < 2e-16 ***
## zipcode98188   3.690e+04  5.460e+03   6.759 1.45e-11 ***
## zipcode98198   1.801e+04  4.034e+03   4.464 8.11e-06 ***
## zipcode98199   3.331e+05  1.127e+04  29.569  < 2e-16 ***
## sqft_living15  1.665e+01  1.659e+00  10.034  < 2e-16 ***
## year           3.466e+04  3.268e+03  10.604  < 2e-16 ***
## month02        2.464e+03  3.198e+03   0.771 0.440945    
## month03       -4.828e+03  2.581e+03  -1.871 0.061420 .  
## month04        1.900e+04  3.006e+03   6.321 2.67e-10 ***
## month05        9.925e+03  3.659e+03   2.712 0.006689 ** 
## month06        3.347e+04  4.108e+03   8.148 3.99e-16 ***
## month07        2.989e+04  4.280e+03   6.983 3.00e-12 ***
## month08        2.761e+04  4.414e+03   6.255 4.09e-10 ***
## month09        2.030e+04  4.059e+03   5.002 5.75e-07 ***
## month10        1.699e+04  4.264e+03   3.986 6.75e-05 ***
## month11        1.348e+04  4.309e+03   3.129 0.001760 ** 
## month12        2.668e+04  4.535e+03   5.883 4.11e-09 ***
## renovated      2.030e+04  3.856e+03   5.264 1.43e-07 ***
## age           -1.863e+02  3.198e+01  -5.825 5.83e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.534 on 15034 degrees of freedom
## Multiple R-squared:  0.8872, Adjusted R-squared:  0.8865 
## F-statistic:  1258 on 94 and 15034 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(house_lm_wls)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

# Append to results.df

# Random Forest Test Results
wls_train_pred = predict(house_lm_wls, newdata = train.dat)
wls_test_pred = predict(house_lm_wls, newdata = test.dat)

wls_train_results = postResample(pred = wls_train_pred, obs = train.dat$price)
wls_test_results = postResample(pred = wls_test_pred, obs = test.dat$price)
wls_test_sse = sum((wls_test_pred - test.dat$price)^2)

# Append to the Results Data Frame
results.df = rbind(results.df,data.frame(model = "Weighted Least Squared Regression",
            R.Squared.Train = unname(wls_train_results[2]),
            R.Squared.Test = unname(wls_test_results[2]),
            RMSE.test = unname(wls_test_results[1]),
            SSE.test = wls_test_sse)
)

Analysis: Before performing Ridge, Lasso, and ElasticNet regression methods, we will assess the multicollinearity in the model. There are no multicollinearity issues in the train data set since we have handled the highly correlated variables during the data preparation process.

# Testing the main (boxcox) model for multicollinearity
vif(house_lm1)
##                    GVIF Df GVIF^(1/(2*Df))
## bedrooms       1.673078  1        1.293475
## bathrooms      3.367212  1        1.834996
## sqft_lot       1.235053  1        1.111330
## floors         2.456560  1        1.567342
## waterfront     1.238314  1        1.112796
## view           1.502095  1        1.225600
## condition      1.334053  1        1.155012
## grade          3.800679  1        1.949533
## sqft_above     5.240046  1        2.289115
## sqft_basement  2.062535  1        1.436153
## zipcode        5.471579 69        1.012392
## sqft_living15  3.355865  1        1.831902
## year          10.074833  1        3.174088
## month         10.668717 11        1.113608
## renovated      1.152531  1        1.073560
## age            2.959803  1        1.720408

Analysis: The Ridge regression generates a lambda value of 24125.359. The model performs well with a r-squared value of 0.69. The mean squared error is high with a value of 39,928,261,630.

set.seed(1023)
n<-dim(df_house_original)[1]
IND<-sample(c(1:n),round(n*0.7))
train.dat.o<-df_house_original[IND,]
test.dat.o<-df_house_original[-c(IND),]

dim(train.dat.o)
## [1] 15129    24
dim(test.dat.o)
## [1] 6484   24
# Ridge Regression

# Extract 'x' and 'y'
x <- data.matrix(dplyr::select(train.dat.o, -price))
y <- train.dat$price

# Perform ridge regression
house_lm_ridge <- glmnet::cv.glmnet(x, y, alpha = 0, nlambda = 100, 
                                    lambda.min.ratio = 0.0001)
best.lambda.ridge <- house_lm_ridge$lambda.min
plot(house_lm_ridge)

print(paste0("Ridge best lambda of ", round(best.lambda.ridge, digits = 3)))
## [1] "Ridge best lambda of 25391.812"
# Generating the results of the model
price.predictors.train <- colnames(dplyr::select(train.dat.o, -price))

ridge_results <- data.frame(
  price.train = train.dat.o$price,
  price.ridge.train = predict(
    house_lm_ridge, s = best.lambda.ridge, 
    newx = data.matrix(train.dat.o[price.predictors.train]))
)

calc_metrics <- function(actual, predicted) {
  sse <- sum((actual - predicted) ^ 2)
  mse <- sse / length(actual)
  rmse <- sqrt(mse) # Calculate RMSE
  sst <- sum((actual - mean(actual)) ^ 2)
  r2 <- 1 - sse / sst
  return(c(SST = sst, SSE = sse, MSE = mse, RMSE = rmse, R2 = r2))
}

# function to each set of predictions
ridge_metrics <- data.frame(
  Model = c("Ridge"),
  do.call(rbind, lapply(
    2:ncol(ridge_results), 
    function(i) calc_metrics(ridge_results$price.train, ridge_results[,i])))
)

# Display the metrics table with RMSE
ridge_metrics %>%
  dplyr::arrange(desc(R2)) %>%
  knitr::kable(caption = "SST, SSE, MSE, RMSE, and R2 of the Model")
SST, SSE, MSE, RMSE, and R2 of the Model
Model SST SSE MSE RMSE R2
Ridge 1.978906e+15 5.91172e+14 39075416211 197675 0.7012632
# Append to results.df
ridge_train_pred = predict(house_lm_ridge, 
                           s = best.lambda.ridge, 
                           newx = data.matrix(train.dat.o[price.predictors.train]))
ridge_test_pred = predict(house_lm_ridge,
                          s = best.lambda.ridge,
                          newx = data.matrix(test.dat.o[price.predictors.train]))

ridge_train_results = postResample(pred = ridge_train_pred, obs = train.dat.o$price)
ridge_test_results = postResample(pred = ridge_test_pred, obs = test.dat.o$price)
ridge_test_sse = sum((ridge_test_pred - test.dat$price)^2)

# Append to the Results Data Frame
results.df = rbind(results.df,data.frame(model = "Ridge Regression",
            R.Squared.Train = unname(ridge_train_results[2]),
            R.Squared.Test = unname(ridge_test_results[2]),
            RMSE.test = unname(ridge_test_results[1]),
            SSE.test = ridge_test_sse)
)

Analysis: The Lasso regression generates a lambda value of 393.183. The r-squared value of 0.69 is indicative of the model’s considerable strength. Out of the three models used for dealing with multicollinearity, this one had the lowest RMSE value at 199,084.7.

# Lasso Regression
house_lm_lasso <- glmnet::cv.glmnet(x, y, alpha = 1, nlambda = 100, lambda.min.ratio = 0.0001)
best.lambda.lasso <- house_lm_lasso$lambda.min
plot(house_lm_lasso)

print(paste0("Lasso best lambda of ", round(best.lambda.lasso, digits = 3)))
## [1] "Lasso best lambda of 343.563"
# Generating the results of the model
lasso_results <- data.frame(
  price.train = train.dat.o$price,
  price.lasso.train = predict(house_lm_lasso, s = best.lambda.lasso, newx = data.matrix(train.dat.o[price.predictors.train]))
)

calc_metrics <- function(actual, predicted) {
  sse <- sum((actual - predicted) ^ 2)
  mse <- sse / length(actual)
  rmse <- sqrt(mse) # Calculate RMSE
  sst <- sum((actual - mean(actual)) ^ 2)
  r2 <- 1 - sse / sst
  return(c(SST = sst, SSE = sse, MSE = mse, RMSE = rmse, R2 = r2))
}

# function to each set of predictions
lasso_metrics <- data.frame(
  Model = c("Lasso"),
  do.call(rbind, lapply(2:ncol(lasso_results), function(i) calc_metrics(lasso_results$price.train, lasso_results[,i])))
)

# Display the metrics table with RMSE
lasso_metrics %>%
  dplyr::arrange(desc(R2)) %>%
  knitr::kable(caption = "SST, SSE, MSE, RMSE, and R2 of the Model")
SST, SSE, MSE, RMSE, and R2 of the Model
Model SST SSE MSE RMSE R2
Lasso 1.978906e+15 5.882058e+14 38879354877 197178.5 0.7027622
# Append to results.df

lasso_train_pred = predict(house_lm_lasso, 
                           s = best.lambda.lasso, 
                           newx = data.matrix(train.dat.o[price.predictors.train]))
lasso_test_pred = predict(house_lm_lasso, 
                          s = best.lambda.lasso, 
                          newx = data.matrix(test.dat.o[price.predictors.train]))

lasso_train_results = postResample(pred = lasso_train_pred, obs = train.dat.o$price)
lasso_test_results = postResample(pred = lasso_test_pred, obs = test.dat.o$price)
lasso_test_sse = sum((lasso_test_pred - test.dat$price)^2)

# Append to the Results Data Frame
results.df = rbind(results.df,data.frame(model = "Lasso Regression",
            R.Squared.Train = unname(lasso_train_results[2]),
            R.Squared.Test = unname(lasso_test_results[2]),
            RMSE.test = unname(lasso_test_results[1]),
            SSE.test = lasso_test_sse)
)

Analysis: The ElasticNet regression generates a lambda value of 786.366. Similar to the Ridge and Lasso Regression models, the r-squared value of this model is 0.69. Again, the MSE and RMSE are quite high. This remedial measure may help generalize the model for higher applicability.

# Elastic Net Regression
house_lm_enet <- glmnet::cv.glmnet(x, y, alpha = 0.5, nlambda = 100, lambda.min.ratio = 0.0001)
plot(house_lm_enet)

best.lambda.enet <- house_lm_enet$lambda.min

print(paste0("ElasticNet best lambda of ", round(best.lambda.enet, digits = 3)))
## [1] "ElasticNet best lambda of 687.127"
# Generating the results of the model
enet_results <- data.frame(
  price.train = train.dat.o$price,
  price.enet.train = predict(house_lm_enet, s = best.lambda.enet, newx = data.matrix(train.dat.o[price.predictors.train]))
)

calc_metrics <- function(actual, predicted) {
  sse <- sum((actual - predicted) ^ 2)
  mse <- sse / length(actual)
  rmse <- sqrt(mse) # Calculate RMSE
  sst <- sum((actual - mean(actual)) ^ 2)
  r2 <- 1 - sse / sst
  return(c(SST = sst, SSE = sse, MSE = mse, RMSE = rmse, R2 = r2))
}

# function to each set of predictions
enet_metrics <- data.frame(
  Model = c("ElasticNet"),
  do.call(rbind, lapply(2:ncol(enet_results), function(i) calc_metrics(enet_results$price.train, enet_results[,i])))
)

# Display the metrics table with RMSE
enet_metrics %>%
  dplyr::arrange(desc(R2)) %>%
  knitr::kable(caption = "SST, SSE, MSE, RMSE, and R2 of the Model")
SST, SSE, MSE, RMSE, and R2 of the Model
Model SST SSE MSE RMSE R2
ElasticNet 1.978906e+15 5.882044e+14 38879264618 197178.3 0.7027629
# Append to results.df

enet_train_pred = predict(house_lm_enet,
                          s = best.lambda.enet,
                          newx = data.matrix(train.dat.o[price.predictors.train]))
enet_test_pred = predict(house_lm_enet,
                         s = best.lambda.enet,
                         newx = data.matrix(test.dat.o[price.predictors.train]))

enet_train_results = postResample(pred = enet_train_pred, obs = train.dat.o$price)
enet_test_results = postResample(pred = enet_test_pred, obs = test.dat.o$price)
enet_test_sse = sum((enet_test_pred - test.dat$price)^2)

# Append to the Results Data Frame
results.df = rbind(results.df,data.frame(model = "Elastic Net Regression",
            R.Squared.Train = unname(enet_train_results[2]),
            R.Squared.Test = unname(enet_test_results[2]),
            RMSE.test = unname(enet_test_results[1]),
            SSE.test = enet_test_sse)
)

Analysis: Before the Robust regression method using Huber and Bisquare weights, we will assess the severity of the outliers.

The Cook’s D chart shows two observations that could have a substantial level of influence on the model.

The Outlier and Leverage Diagnostics for price^lambda plot illustrates numerous leverage, outlier, and outlier and leverage values. These outlying values could be having a large effect on the performance of the model. The model results, consequently, may suffer from reliability in their interpretation.

The Deleted Studentized Residual vs Predicted Values plot depicts considerable portions of data outlying the thresholds.

# [Ibrahim]
# Looking at the residuals using the Cook's distance chart
ols_plot_cooksd_chart(house_lm1)

# Studentized Residuals vs Leverage Plot
ols_plot_resid_lev(house_lm1)

# Deleted Studentized Residuals vs Fitted Values Plot
ols_plot_resid_stud_fit(house_lm1)

Analysis: The Robust Regression model with the Huber weights results in a residual standard error of 112700. This is because Huber weights have difficulties with severe outliers which we can notice from the plots above. After performing the Robust regression however, we can notice that there are no influential outliers based on the Residuals vs Leverage plot. This ensures that the interpretation of the model’s results will be reliable.

# Robust Regression using Huber weights 
# There are influential points (see plots above)
house_lm_huber <- MASS::rlm(price ~ ., psi = psi.huber, data = train.dat)
summary(house_lm_huber)
## 
## Call: rlm(formula = price ~ ., data = train.dat, psi = psi.huber)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -906767.0  -52617.8     521.7   52381.7 5162533.2 
## 
## Coefficients:
##               Value         Std. Error    t value      
## (Intercept)   -9.142345e+07  9.685652e+06 -9.439100e+00
## bedrooms      -1.036066e+04  9.750903e+02 -1.062530e+01
## bathrooms      1.598913e+04  1.688245e+03  9.470900e+00
## sqft_lot       3.175000e-01  2.020000e-02  1.573940e+01
## floors        -2.487430e+04  2.057818e+03 -1.208770e+01
## waterfront     5.263568e+05  9.586433e+03  5.490640e+01
## view           4.553335e+04  1.148339e+03  3.965150e+01
## condition      2.504671e+04  1.251488e+03  2.001350e+01
## grade          4.749743e+04  1.177707e+03  4.033040e+01
## sqft_above     1.404457e+02  1.969300e+00  7.131860e+01
## sqft_basement  8.514710e+01  2.294300e+00  3.711270e+01
## zipcode98002   1.648089e+04  9.010934e+03  1.829000e+00
## zipcode98003  -3.704417e+03  8.306187e+03 -4.460000e-01
## zipcode98004   7.096101e+05  8.083154e+03  8.778880e+01
## zipcode98005   3.193243e+05  9.615463e+03  3.320950e+01
## zipcode98006   2.746299e+05  7.300935e+03  3.761570e+01
## zipcode98007   2.492407e+05  1.033620e+04  2.411340e+01
## zipcode98008   2.425199e+05  8.286615e+03  2.926650e+01
## zipcode98010   6.121594e+04  1.168614e+04  5.238300e+00
## zipcode98011   1.394875e+05  9.304570e+03  1.499130e+01
## zipcode98014   1.251285e+05  1.113224e+04  1.124020e+01
## zipcode98019   9.739583e+04  9.155996e+03  1.063740e+01
## zipcode98022  -6.307204e+03  8.828249e+03 -7.144000e-01
## zipcode98023  -1.709710e+04  7.147175e+03 -2.392100e+00
## zipcode98024   1.418280e+05  1.305863e+04  1.086090e+01
## zipcode98027   1.952509e+05  7.458891e+03  2.617690e+01
## zipcode98028   1.304999e+05  8.387064e+03  1.555970e+01
## zipcode98029   2.216358e+05  7.926921e+03  2.795990e+01
## zipcode98030   4.833502e+03  8.429035e+03  5.734000e-01
## zipcode98031   1.124923e+04  8.223943e+03  1.367900e+00
## zipcode98032  -2.500411e+03  1.087951e+04 -2.298000e-01
## zipcode98033   3.255407e+05  7.435745e+03  4.378050e+01
## zipcode98034   1.918027e+05  7.020225e+03  2.732140e+01
## zipcode98038   3.741830e+04  6.847773e+03  5.464300e+00
## zipcode98039   1.088986e+06  1.601374e+04  6.800320e+01
## zipcode98040   4.825008e+05  8.427399e+03  5.725380e+01
## zipcode98042   6.782856e+03  6.966151e+03  9.737000e-01
## zipcode98045   1.035766e+05  8.829111e+03  1.173130e+01
## zipcode98052   2.481044e+05  6.880090e+03  3.606120e+01
## zipcode98053   2.237339e+05  7.627845e+03  2.933120e+01
## zipcode98055   4.541222e+04  8.288378e+03  5.479000e+00
## zipcode98056   1.008883e+05  7.398864e+03  1.363560e+01
## zipcode98058   3.128997e+04  7.283497e+03  4.296000e+00
## zipcode98059   9.100609e+04  7.209482e+03  1.262310e+01
## zipcode98065   1.107907e+05  7.971981e+03  1.389750e+01
## zipcode98070   6.602524e+04  1.120289e+04  5.893600e+00
## zipcode98072   1.668108e+05  8.404266e+03  1.984830e+01
## zipcode98074   2.055064e+05  7.359151e+03  2.792530e+01
## zipcode98075   2.013117e+05  7.783830e+03  2.586280e+01
## zipcode98077   1.547341e+05  9.456414e+03  1.636290e+01
## zipcode98092  -2.833064e+04  7.637296e+03 -3.709500e+00
## zipcode98102   4.322050e+05  1.112157e+04  3.886190e+01
## zipcode98103   3.203349e+05  7.102052e+03  4.510460e+01
## zipcode98105   4.182605e+05  8.981836e+03  4.656740e+01
## zipcode98106   1.281162e+05  7.776655e+03  1.647450e+01
## zipcode98107   3.239683e+05  8.521026e+03  3.801990e+01
## zipcode98108   1.311999e+05  9.183147e+03  1.428700e+01
## zipcode98109   4.472080e+05  1.136883e+04  3.933630e+01
## zipcode98112   5.582963e+05  8.673731e+03  6.436630e+01
## zipcode98115   3.243484e+05  7.083246e+03  4.579090e+01
## zipcode98116   2.970097e+05  7.881334e+03  3.768520e+01
## zipcode98117   3.171497e+05  7.145593e+03  4.438400e+01
## zipcode98118   1.654538e+05  7.200299e+03  2.297870e+01
## zipcode98119   4.453525e+05  9.318181e+03  4.779390e+01
## zipcode98122   3.112322e+05  8.292870e+03  3.753010e+01
## zipcode98125   2.024326e+05  7.516078e+03  2.693330e+01
## zipcode98126   1.944648e+05  7.880222e+03  2.467760e+01
## zipcode98133   1.617212e+05  7.166751e+03  2.256550e+01
## zipcode98136   2.526864e+05  8.554514e+03  2.953840e+01
## zipcode98144   2.485439e+05  7.857901e+03  3.162980e+01
## zipcode98146   1.090584e+05  8.308933e+03  1.312540e+01
## zipcode98148   5.848791e+04  1.369842e+04  4.269700e+00
## zipcode98155   1.438938e+05  7.313315e+03  1.967560e+01
## zipcode98166   8.768517e+04  8.354396e+03  1.049570e+01
## zipcode98168   6.459817e+04  8.451818e+03  7.643100e+00
## zipcode98177   2.085875e+05  8.583858e+03  2.430000e+01
## zipcode98178   5.579866e+04  8.490433e+03  6.571900e+00
## zipcode98188   3.428638e+04  1.064399e+04  3.221200e+00
## zipcode98198   1.737989e+04  8.414422e+03  2.065500e+00
## zipcode98199   3.682742e+05  8.035034e+03  4.583360e+01
## sqft_living15  3.028260e+01  1.885200e+00  1.606350e+01
## year           4.514620e+04  4.806712e+03  9.392300e+00
## month02        4.387756e+03  4.448824e+03  9.863000e-01
## month03        2.182048e+04  4.089470e+03  5.335800e+00
## month04        2.632319e+04  3.988887e+03  6.599100e+00
## month05        3.546692e+04  5.276518e+03  6.721700e+00
## month06        4.218728e+04  6.245104e+03  6.755300e+00
## month07        3.948528e+04  6.242156e+03  6.325600e+00
## month08        3.809841e+04  6.300645e+03  6.046700e+00
## month09        3.882065e+04  6.343869e+03  6.119400e+00
## month10        3.695119e+04  6.321652e+03  5.845200e+00
## month11        3.635348e+04  6.477748e+03  5.612100e+00
## month12        3.669037e+04  6.435225e+03  5.701500e+00
## renovated      6.484695e+04  3.764299e+03  1.722680e+01
## age            4.621834e+02  4.222140e+01  1.094660e+01
## 
## Residual standard error: 77830 on 15034 degrees of freedom
par(mfrow=c(2,2))
plot(house_lm_huber)

hub1_train_pred = predict(house_lm_huber, newdata = train.dat)
hub1_test_pred = predict(house_lm_huber, newdata = test.dat)

hub1_train_results = postResample(pred = hub1_train_pred, obs = train.dat$price)
hub1_test_results = postResample(pred = hub1_test_pred, obs = test.dat$price)
hub1_test_sse = sum((hub1_test_pred - test.dat$price)^2)

# Append to the Results Data Frame
results.df = rbind(results.df,data.frame(model = "Robust Regression",
            R.Squared.Train = unname(hub1_train_results[2]),
            R.Squared.Test = unname(hub1_test_results[2]),
            RMSE.test = unname(hub1_test_results[1]),
            SSE.test = hub1_test_sse)
)
# Taking a quick look at the Huber weights
huber_weights <- data.frame(Observation = 1:nrow(train.dat), 
                            Residual = house_lm_huber$resid, 
                            Weight = house_lm_huber$w)

a <- huber_weights[order(house_lm_huber$w), ]

head10 <- head(a$Observation,10)
head10
##  [1] 13244  9986  5127  7807 11395  8814   643  7565  6125  4331
dim(train.dat)
## [1] 15129    17
# weight threshold for outliers
weight_threshold <- 0.1
outlier_indices <- which(house_lm_huber$w < weight_threshold)
train.dat.cleaned <- train.dat[-outlier_indices, ]

# c(7469,14108,455,14106,14105,14104)

house_lm_c <- lm(price ~ ., data = train.dat.cleaned)
summary(house_lm_c)
## 
## Call:
## lm(formula = price ~ ., data = train.dat.cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -905577  -62422   -1168   53858  997194 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.011e+08  1.404e+07  -7.201 6.28e-13 ***
## bedrooms      -1.435e+04  1.414e+03 -10.143  < 2e-16 ***
## bathrooms      1.841e+04  2.447e+03   7.523 5.66e-14 ***
## sqft_lot       3.054e-01  2.908e-02  10.504  < 2e-16 ***
## floors        -3.005e+04  2.980e+03 -10.084  < 2e-16 ***
## waterfront     4.513e+05  1.484e+04  30.417  < 2e-16 ***
## view           5.038e+04  1.680e+03  29.998  < 2e-16 ***
## condition      2.766e+04  1.807e+03  15.306  < 2e-16 ***
## grade          5.520e+04  1.706e+03  32.360  < 2e-16 ***
## sqft_above     1.579e+02  2.903e+00  54.391  < 2e-16 ***
## sqft_basement  9.385e+01  3.361e+00  27.929  < 2e-16 ***
## zipcode98002   2.333e+04  1.298e+04   1.797 0.072279 .  
## zipcode98003  -1.103e+04  1.196e+04  -0.922 0.356709    
## zipcode98004   7.205e+05  1.179e+04  61.111  < 2e-16 ***
## zipcode98005   3.086e+05  1.385e+04  22.278  < 2e-16 ***
## zipcode98006   2.653e+05  1.056e+04  25.122  < 2e-16 ***
## zipcode98007   2.457e+05  1.489e+04  16.505  < 2e-16 ***
## zipcode98008   2.492e+05  1.195e+04  20.856  < 2e-16 ***
## zipcode98010   7.360e+04  1.683e+04   4.373 1.23e-05 ***
## zipcode98011   1.275e+05  1.340e+04   9.518  < 2e-16 ***
## zipcode98014   1.252e+05  1.603e+04   7.809 6.15e-15 ***
## zipcode98019   9.267e+04  1.319e+04   7.028 2.18e-12 ***
## zipcode98022  -1.045e+04  1.271e+04  -0.822 0.411073    
## zipcode98023  -2.671e+04  1.029e+04  -2.595 0.009469 ** 
## zipcode98024   1.456e+05  1.881e+04   7.745 1.02e-14 ***
## zipcode98027   1.861e+05  1.074e+04  17.327  < 2e-16 ***
## zipcode98028   1.236e+05  1.208e+04  10.236  < 2e-16 ***
## zipcode98029   2.180e+05  1.142e+04  19.095  < 2e-16 ***
## zipcode98030   3.713e+03  1.214e+04   0.306 0.759674    
## zipcode98031   9.913e+03  1.184e+04   0.837 0.402613    
## zipcode98032  -2.208e+03  1.567e+04  -0.141 0.887919    
## zipcode98033   3.577e+05  1.074e+04  33.300  < 2e-16 ***
## zipcode98034   1.948e+05  1.014e+04  19.216  < 2e-16 ***
## zipcode98038   3.308e+04  9.862e+03   3.354 0.000799 ***
## zipcode98039   1.039e+06  2.401e+04  43.285  < 2e-16 ***
## zipcode98040   5.127e+05  1.221e+04  42.001  < 2e-16 ***
## zipcode98042   3.783e+03  1.003e+04   0.377 0.706116    
## zipcode98045   1.004e+05  1.271e+04   7.895 3.11e-15 ***
## zipcode98052   2.386e+05  9.909e+03  24.073  < 2e-16 ***
## zipcode98053   2.058e+05  1.099e+04  18.726  < 2e-16 ***
## zipcode98055   5.088e+04  1.194e+04   4.263 2.03e-05 ***
## zipcode98056   1.020e+05  1.066e+04   9.574  < 2e-16 ***
## zipcode98058   2.855e+04  1.049e+04   2.722 0.006501 ** 
## zipcode98059   8.741e+04  1.038e+04   8.418  < 2e-16 ***
## zipcode98065   9.837e+04  1.148e+04   8.568  < 2e-16 ***
## zipcode98070   2.227e+04  1.617e+04   1.377 0.168601    
## zipcode98072   1.580e+05  1.210e+04  13.054  < 2e-16 ***
## zipcode98074   1.889e+05  1.061e+04  17.810  < 2e-16 ***
## zipcode98075   1.833e+05  1.124e+04  16.313  < 2e-16 ***
## zipcode98077   1.308e+05  1.362e+04   9.605  < 2e-16 ***
## zipcode98092  -3.821e+04  1.100e+04  -3.474 0.000514 ***
## zipcode98102   4.669e+05  1.622e+04  28.780  < 2e-16 ***
## zipcode98103   3.413e+05  1.023e+04  33.355  < 2e-16 ***
## zipcode98105   4.560e+05  1.306e+04  34.926  < 2e-16 ***
## zipcode98106   1.402e+05  1.120e+04  12.520  < 2e-16 ***
## zipcode98107   3.327e+05  1.231e+04  27.024  < 2e-16 ***
## zipcode98108   1.399e+05  1.323e+04  10.578  < 2e-16 ***
## zipcode98109   4.864e+05  1.653e+04  29.430  < 2e-16 ***
## zipcode98112   6.137e+05  1.264e+04  48.562  < 2e-16 ***
## zipcode98115   3.373e+05  1.020e+04  33.054  < 2e-16 ***
## zipcode98116   3.009e+05  1.135e+04  26.505  < 2e-16 ***
## zipcode98117   3.286e+05  1.029e+04  31.921  < 2e-16 ***
## zipcode98118   1.758e+05  1.038e+04  16.941  < 2e-16 ***
## zipcode98119   4.690e+05  1.353e+04  34.673  < 2e-16 ***
## zipcode98122   3.293e+05  1.195e+04  27.567  < 2e-16 ***
## zipcode98125   2.103e+05  1.083e+04  19.424  < 2e-16 ***
## zipcode98126   2.041e+05  1.135e+04  17.979  < 2e-16 ***
## zipcode98133   1.717e+05  1.032e+04  16.635  < 2e-16 ***
## zipcode98136   2.599e+05  1.232e+04  21.088  < 2e-16 ***
## zipcode98144   2.722e+05  1.136e+04  23.957  < 2e-16 ***
## zipcode98146   1.155e+05  1.197e+04   9.652  < 2e-16 ***
## zipcode98148   6.815e+04  1.973e+04   3.455 0.000552 ***
## zipcode98155   1.470e+05  1.054e+04  13.947  < 2e-16 ***
## zipcode98166   7.963e+04  1.204e+04   6.615 3.85e-11 ***
## zipcode98168   7.664e+04  1.217e+04   6.296 3.13e-10 ***
## zipcode98177   2.285e+05  1.240e+04  18.422  < 2e-16 ***
## zipcode98178   5.994e+04  1.223e+04   4.902 9.59e-07 ***
## zipcode98188   3.556e+04  1.533e+04   2.320 0.020346 *  
## zipcode98198   1.242e+04  1.212e+04   1.025 0.305579    
## zipcode98199   3.853e+05  1.159e+04  33.252  < 2e-16 ***
## sqft_living15  3.390e+01  2.751e+00  12.326  < 2e-16 ***
## year           4.989e+04  6.966e+03   7.162 8.32e-13 ***
## month02        1.045e+03  6.417e+03   0.163 0.870612    
## month03        2.404e+04  5.900e+03   4.074 4.64e-05 ***
## month04        2.598e+04  5.755e+03   4.515 6.37e-06 ***
## month05        3.488e+04  7.639e+03   4.566 5.02e-06 ***
## month06        4.354e+04  9.034e+03   4.820 1.45e-06 ***
## month07        4.221e+04  9.028e+03   4.676 2.96e-06 ***
## month08        3.699e+04  9.115e+03   4.058 4.98e-05 ***
## month09        3.547e+04  9.178e+03   3.864 0.000112 ***
## month10        3.791e+04  9.144e+03   4.146 3.40e-05 ***
## month11        3.894e+04  9.369e+03   4.156 3.26e-05 ***
## month12        4.055e+04  9.308e+03   4.356 1.33e-05 ***
## renovated      7.129e+04  5.476e+03  13.017  < 2e-16 ***
## age            5.299e+02  6.111e+01   8.671  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 125500 on 14963 degrees of freedom
## Multiple R-squared:  0.8381, Adjusted R-squared:  0.8371 
## F-statistic: 823.9 on 94 and 14963 DF,  p-value: < 2.2e-16
bcc <- boxcox(house_lm_c, lambda=seq(-4, 4, by=0.1))

lambda_c <- bcc$x[which.max(bcc$y)]
cat("The optimal lambda is:", lambda, "\n")
## The optimal lambda is: 0.1012121
hub2_train_pred = predict(house_lm_c, newdata = train.dat.cleaned)
hub2_test_pred = predict(house_lm_c, newdata = test.dat)

hub2_train_results = postResample(pred = hub2_train_pred, obs = train.dat.cleaned$price)
hub2_test_results = postResample(pred = hub2_test_pred, obs = test.dat$price)
hub2_test_sse = sum((hub2_test_pred - test.dat$price)^2)

# Append to the Results Data Frame
results.df = rbind(results.df,data.frame(model = "Robust Regression (clean)",
            R.Squared.Train = unname(hub2_train_results[2]),
            R.Squared.Test = unname(hub2_test_results[2]),
            RMSE.test = unname(hub2_test_results[1]),
            SSE.test = hub2_test_sse)
)
house_lm_c1 <- lm(log(price) ~ ., data = train.dat.cleaned)
summary(house_lm_c1)
## 
## Call:
## lm(formula = log(price) ~ ., data = train.dat.cleaned)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30270 -0.09474  0.01018  0.10276  0.93773 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.003e+02  2.024e+01  -9.893  < 2e-16 ***
## bedrooms       3.233e-03  2.040e-03   1.585 0.113021    
## bathrooms      3.698e-02  3.528e-03  10.479  < 2e-16 ***
## sqft_lot       7.325e-07  4.194e-08  17.467  < 2e-16 ***
## floors        -3.176e-02  4.298e-03  -7.388 1.56e-13 ***
## waterfront     4.510e-01  2.140e-02  21.080  < 2e-16 ***
## view           5.684e-02  2.422e-03  23.468  < 2e-16 ***
## condition      6.197e-02  2.607e-03  23.773  < 2e-16 ***
## grade          8.715e-02  2.460e-03  35.424  < 2e-16 ***
## sqft_above     2.028e-04  4.187e-06  48.436  < 2e-16 ***
## sqft_basement  1.300e-04  4.846e-06  26.834  < 2e-16 ***
## zipcode98002  -3.760e-02  1.871e-02  -2.009 0.044529 *  
## zipcode98003   2.406e-02  1.725e-02   1.395 0.163185    
## zipcode98004   1.073e+00  1.700e-02  63.128  < 2e-16 ***
## zipcode98005   7.105e-01  1.997e-02  35.572  < 2e-16 ***
## zipcode98006   6.000e-01  1.523e-02  39.398  < 2e-16 ***
## zipcode98007   6.406e-01  2.147e-02  29.842  < 2e-16 ***
## zipcode98008   6.426e-01  1.723e-02  37.286  < 2e-16 ***
## zipcode98010   2.459e-01  2.427e-02  10.133  < 2e-16 ***
## zipcode98011   4.466e-01  1.932e-02  23.111  < 2e-16 ***
## zipcode98014   3.257e-01  2.312e-02  14.088  < 2e-16 ***
## zipcode98019   3.333e-01  1.902e-02  17.529  < 2e-16 ***
## zipcode98022   3.089e-02  1.834e-02   1.684 0.092114 .  
## zipcode98023  -2.888e-02  1.484e-02  -1.946 0.051685 .  
## zipcode98024   3.901e-01  2.712e-02  14.384  < 2e-16 ***
## zipcode98027   5.191e-01  1.549e-02  33.505  < 2e-16 ***
## zipcode98028   4.194e-01  1.742e-02  24.075  < 2e-16 ***
## zipcode98029   5.982e-01  1.646e-02  36.336  < 2e-16 ***
## zipcode98030   4.409e-02  1.750e-02   2.519 0.011790 *  
## zipcode98031   6.812e-02  1.708e-02   3.988 6.69e-05 ***
## zipcode98032  -2.332e-02  2.259e-02  -1.032 0.302038    
## zipcode98033   7.622e-01  1.549e-02  49.199  < 2e-16 ***
## zipcode98034   5.331e-01  1.462e-02  36.460  < 2e-16 ***
## zipcode98038   1.683e-01  1.422e-02  11.832  < 2e-16 ***
## zipcode98039   1.233e+00  3.463e-02  35.617  < 2e-16 ***
## zipcode98040   8.370e-01  1.760e-02  47.548  < 2e-16 ***
## zipcode98042   5.081e-02  1.447e-02   3.512 0.000446 ***
## zipcode98045   3.319e-01  1.834e-02  18.102  < 2e-16 ***
## zipcode98052   6.332e-01  1.429e-02  44.309  < 2e-16 ***
## zipcode98053   5.645e-01  1.585e-02  35.624  < 2e-16 ***
## zipcode98055   1.303e-01  1.721e-02   7.569 3.98e-14 ***
## zipcode98056   3.069e-01  1.537e-02  19.976  < 2e-16 ***
## zipcode98058   1.484e-01  1.513e-02   9.814  < 2e-16 ***
## zipcode98059   3.211e-01  1.497e-02  21.442  < 2e-16 ***
## zipcode98065   3.843e-01  1.656e-02  23.208  < 2e-16 ***
## zipcode98070   3.123e-01  2.332e-02  13.387  < 2e-16 ***
## zipcode98072   4.870e-01  1.746e-02  27.902  < 2e-16 ***
## zipcode98074   5.519e-01  1.530e-02  36.076  < 2e-16 ***
## zipcode98075   5.311e-01  1.620e-02  32.780  < 2e-16 ***
## zipcode98077   4.301e-01  1.964e-02  21.894  < 2e-16 ***
## zipcode98092   1.368e-02  1.586e-02   0.862 0.388605    
## zipcode98102   9.428e-01  2.340e-02  40.299  < 2e-16 ***
## zipcode98103   8.027e-01  1.476e-02  54.400  < 2e-16 ***
## zipcode98105   9.343e-01  1.883e-02  49.625  < 2e-16 ***
## zipcode98106   2.954e-01  1.615e-02  18.290  < 2e-16 ***
## zipcode98107   8.136e-01  1.776e-02  45.825  < 2e-16 ***
## zipcode98108   3.681e-01  1.907e-02  19.300  < 2e-16 ***
## zipcode98109   9.724e-01  2.383e-02  40.801  < 2e-16 ***
## zipcode98112   1.028e+00  1.822e-02  56.423  < 2e-16 ***
## zipcode98115   8.071e-01  1.471e-02  54.848  < 2e-16 ***
## zipcode98116   7.393e-01  1.637e-02  45.150  < 2e-16 ***
## zipcode98117   7.980e-01  1.484e-02  53.761  < 2e-16 ***
## zipcode98118   4.378e-01  1.496e-02  29.253  < 2e-16 ***
## zipcode98119   9.728e-01  1.950e-02  49.873  < 2e-16 ***
## zipcode98122   7.836e-01  1.723e-02  45.485  < 2e-16 ***
## zipcode98125   5.668e-01  1.561e-02  36.305  < 2e-16 ***
## zipcode98126   5.260e-01  1.637e-02  32.137  < 2e-16 ***
## zipcode98133   4.501e-01  1.488e-02  30.236  < 2e-16 ***
## zipcode98136   6.679e-01  1.777e-02  37.583  < 2e-16 ***
## zipcode98144   6.386e-01  1.638e-02  38.981  < 2e-16 ***
## zipcode98146   2.608e-01  1.726e-02  15.113  < 2e-16 ***
## zipcode98148   1.377e-01  2.845e-02   4.841 1.31e-06 ***
## zipcode98155   4.142e-01  1.520e-02  27.251  < 2e-16 ***
## zipcode98166   2.913e-01  1.736e-02  16.780  < 2e-16 ***
## zipcode98168   7.663e-02  1.755e-02   4.366 1.28e-05 ***
## zipcode98177   5.906e-01  1.789e-02  33.021  < 2e-16 ***
## zipcode98178   1.359e-01  1.764e-02   7.709 1.35e-14 ***
## zipcode98188   7.757e-02  2.211e-02   3.509 0.000451 ***
## zipcode98198   6.829e-02  1.748e-02   3.908 9.36e-05 ***
## zipcode98199   8.377e-01  1.671e-02  50.132  < 2e-16 ***
## sqft_living15  8.737e-05  3.967e-06  22.026  < 2e-16 ***
## year           1.049e-01  1.005e-02  10.439  < 2e-16 ***
## month02        1.514e-02  9.254e-03   1.637 0.101755    
## month03        4.758e-02  8.508e-03   5.593 2.28e-08 ***
## month04        6.717e-02  8.299e-03   8.094 6.22e-16 ***
## month05        8.416e-02  1.102e-02   7.639 2.32e-14 ***
## month06        1.040e-01  1.303e-02   7.985 1.51e-15 ***
## month07        9.716e-02  1.302e-02   7.463 8.93e-14 ***
## month08        9.963e-02  1.315e-02   7.579 3.68e-14 ***
## month09        1.016e-01  1.324e-02   7.679 1.70e-14 ***
## month10        9.202e-02  1.319e-02   6.979 3.10e-12 ***
## month11        8.393e-02  1.351e-02   6.212 5.38e-10 ***
## month12        9.709e-02  1.342e-02   7.233 4.95e-13 ***
## renovated      1.002e-01  7.898e-03  12.691  < 2e-16 ***
## age            2.191e-04  8.813e-05   2.486 0.012921 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1809 on 14963 degrees of freedom
## Multiple R-squared:  0.8746, Adjusted R-squared:  0.8738 
## F-statistic:  1110 on 94 and 14963 DF,  p-value: < 2.2e-16
dim(train.dat.cleaned)
## [1] 15058    17
dim(train.dat)
## [1] 15129    17
par(mfrow = c(2,2))
plot(house_lm_c1)

# Looking at the residuals using the Cook's distance chart
ols_plot_cooksd_chart(house_lm_c1)

# Studentized Residuals vs Leverage Plot
ols_plot_resid_lev(house_lm_c1)

# Deleted Studentized Residuals vs Fitted Values Plot
ols_plot_resid_stud_fit(house_lm_c1)

hub3_train_pred = exp(predict(house_lm_c1, newdata = train.dat.cleaned))
hub3_test_pred = exp(predict(house_lm_c1, newdata = test.dat))

hub3_train_results = postResample(pred = hub3_train_pred, obs = train.dat.cleaned$price)
hub3_test_results = postResample(pred = hub3_test_pred, obs = test.dat$price)
hub3_test_sse = sum((hub3_test_pred - test.dat$price)^2)

# Append to the Results Data Frame
results.df = rbind(results.df,data.frame(model = "Robust Regression (transformed + clean)",
            R.Squared.Train = unname(hub3_train_results[2]),
            R.Squared.Test = unname(hub3_test_results[2]),
            RMSE.test = unname(hub3_test_results[1]),
            SSE.test = hub3_test_sse)
)
cbind(house_lm1$coefficients,house_lm_huber$coefficients)
##                        [,1]          [,2]
## (Intercept)   -7.984177e+01 -9.142345e+07
## bedrooms      -2.090699e-04 -1.036066e+04
## bathrooms      1.378570e-02  1.598913e+04
## sqft_lot       2.696722e-07  3.175002e-01
## floors        -1.337369e-02 -2.487430e+04
## waterfront     1.856756e-01  5.263568e+05
## view           2.400920e-02  4.553335e+04
## condition      2.332907e-02  2.504671e+04
## grade          3.443885e-02  4.749743e+04
## sqft_above     8.165243e-05  1.404457e+02
## sqft_basement  5.142162e-05  8.514710e+01
## zipcode98002  -1.045429e-02  1.648089e+04
## zipcode98003   6.794888e-03 -3.704417e+03
## zipcode98004   4.188720e-01  7.096101e+05
## zipcode98005   2.647800e-01  3.193243e+05
## zipcode98006   2.261090e-01  2.746299e+05
## zipcode98007   2.367305e-01  2.492407e+05
## zipcode98008   2.375359e-01  2.425199e+05
## zipcode98010   8.921900e-02  6.121594e+04
## zipcode98011   1.615184e-01  1.394875e+05
## zipcode98014   1.206333e-01  1.251285e+05
## zipcode98019   1.202903e-01  9.739583e+04
## zipcode98022   9.191790e-03 -6.307205e+03
## zipcode98023  -1.211335e-02 -1.709710e+04
## zipcode98024   1.430482e-01  1.418280e+05
## zipcode98027   1.905173e-01  1.952509e+05
## zipcode98028   1.517718e-01  1.304999e+05
## zipcode98029   2.201274e-01  2.216358e+05
## zipcode98030   1.498161e-02  4.833502e+03
## zipcode98031   2.342637e-02  1.124923e+04
## zipcode98032  -7.946236e-03 -2.500411e+03
## zipcode98033   2.877161e-01  3.255407e+05
## zipcode98034   1.992787e-01  1.918027e+05
## zipcode98038   5.937352e-02  3.741830e+04
## zipcode98039   4.913841e-01  1.088986e+06
## zipcode98040   3.180687e-01  4.825008e+05
## zipcode98042   1.734562e-02  6.782856e+03
## zipcode98045   1.200448e-01  1.035766e+05
## zipcode98052   2.336908e-01  2.481044e+05
## zipcode98053   2.075329e-01  2.237339e+05
## zipcode98055   4.812806e-02  4.541222e+04
## zipcode98056   1.120329e-01  1.008883e+05
## zipcode98058   5.225891e-02  3.128997e+04
## zipcode98059   1.154523e-01  9.100609e+04
## zipcode98065   1.379774e-01  1.107907e+05
## zipcode98070   1.075700e-01  6.602524e+04
## zipcode98072   1.775348e-01  1.668108e+05
## zipcode98074   2.021780e-01  2.055064e+05
## zipcode98075   1.945190e-01  2.013117e+05
## zipcode98077   1.557672e-01  1.547341e+05
## zipcode98092   1.132781e-03 -2.833064e+04
## zipcode98102   3.577684e-01  4.322050e+05
## zipcode98103   2.993586e-01  3.203349e+05
## zipcode98105   3.532138e-01  4.182605e+05
## zipcode98106   1.113564e-01  1.281162e+05
## zipcode98107   3.053378e-01  3.239683e+05
## zipcode98108   1.358576e-01  1.311999e+05
## zipcode98109   3.695117e-01  4.472080e+05
## zipcode98112   3.993351e-01  5.582963e+05
## zipcode98115   3.003424e-01  3.243484e+05
## zipcode98116   2.738547e-01  2.970097e+05
## zipcode98117   2.967191e-01  3.171497e+05
## zipcode98118   1.626849e-01  1.654538e+05
## zipcode98119   3.694593e-01  4.453525e+05
## zipcode98122   2.915634e-01  3.112322e+05
## zipcode98125   2.085315e-01  2.024326e+05
## zipcode98126   1.941673e-01  1.944648e+05
## zipcode98133   1.659214e-01  1.617212e+05
## zipcode98136   2.463284e-01  2.526864e+05
## zipcode98144   2.397501e-01  2.485439e+05
## zipcode98146   9.789696e-02  1.090584e+05
## zipcode98148   5.225158e-02  5.848791e+04
## zipcode98155   1.520304e-01  1.438938e+05
## zipcode98166   1.039807e-01  8.768517e+04
## zipcode98168   3.274365e-02  6.459817e+04
## zipcode98177   2.192247e-01  2.085875e+05
## zipcode98178   5.086101e-02  5.579866e+04
## zipcode98188   2.898355e-02  3.428638e+04
## zipcode98198   2.377353e-02  1.737989e+04
## zipcode98199   3.133203e-01  3.682742e+05
## sqft_living15  3.223715e-05  3.028262e+01
## year           4.110330e-02  4.514620e+04
## month02        5.538594e-03  4.387756e+03
## month03        1.840067e-02  2.182048e+04
## month04        2.503644e-02  2.632319e+04
## month05        3.277310e-02  3.546692e+04
## month06        4.042445e-02  4.218728e+04
## month07        3.761323e-02  3.948528e+04
## month08        3.866877e-02  3.809841e+04
## month09        3.933506e-02  3.882065e+04
## month10        3.560595e-02  3.695119e+04
## month11        3.352957e-02  3.635348e+04
## month12        3.767499e-02  3.669037e+04
## renovated      4.067475e-02  6.484695e+04
## age            1.216800e-04  4.621834e+02

Analysis: The Robust regression with the bisquare weights has a lower residual standard error compared to the Huber weights.

# Robust Regression using bisquare weights 
house_lm_bisquare <- MASS::rlm(price ~ ., psi = psi.bisquare, data = train.dat)
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
summary(house_lm_bisquare)
## 
## Call: rlm(formula = price ~ ., data = train.dat, psi = psi.bisquare)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -994094  -43436    3968   53217 5520496 
## 
## Coefficients:
##               Value         Std. Error    t value      
## (Intercept)   -7.939504e+07  8.644357e+06 -9.184600e+00
## bedrooms      -3.828591e+03  8.702592e+02 -4.399400e+00
## bathrooms      1.414119e+04  1.506743e+03  9.385300e+00
## sqft_lot       3.728000e-01  1.800000e-02  2.070610e+01
## floors        -2.003598e+04  1.836584e+03 -1.090940e+01
## waterfront     7.090889e+05  8.555804e+03  8.287810e+01
## view           3.305773e+04  1.024882e+03  3.225520e+01
## condition      2.361213e+04  1.116942e+03  2.114000e+01
## grade          4.013147e+04  1.051093e+03  3.818070e+01
## sqft_above     1.136249e+02  1.757600e+00  6.464930e+01
## sqft_basement  6.552360e+01  2.047600e+00  3.199980e+01
## zipcode98002   1.038516e+04  8.042177e+03  1.291300e+00
## zipcode98003   1.064104e+03  7.413196e+03  1.435000e-01
## zipcode98004   5.455434e+05  7.214141e+03  7.562140e+01
## zipcode98005   3.284475e+05  8.581714e+03  3.827290e+01
## zipcode98006   2.794622e+05  6.516018e+03  4.288850e+01
## zipcode98007   2.512651e+05  9.224962e+03  2.723750e+01
## zipcode98008   2.441794e+05  7.395729e+03  3.301630e+01
## zipcode98010   5.559038e+04  1.042978e+04  5.330000e+00
## zipcode98011   1.501711e+05  8.304245e+03  1.808370e+01
## zipcode98014   1.274116e+05  9.935427e+03  1.282400e+01
## zipcode98019   1.010915e+05  8.171643e+03  1.237100e+01
## zipcode98022  -1.751092e+03  7.879132e+03 -2.222000e-01
## zipcode98023  -1.385163e+04  6.378788e+03 -2.171500e+00
## zipcode98024   1.388689e+05  1.165471e+04  1.191530e+01
## zipcode98027   2.046462e+05  6.656993e+03  3.074150e+01
## zipcode98028   1.369723e+05  7.485378e+03  1.829870e+01
## zipcode98029   2.307580e+05  7.074705e+03  3.261730e+01
## zipcode98030   5.606297e+03  7.522837e+03  7.452000e-01
## zipcode98031   1.178252e+04  7.339795e+03  1.605300e+00
## zipcode98032  -3.186326e+03  9.709864e+03 -3.282000e-01
## zipcode98033   3.028901e+05  6.636335e+03  4.564120e+01
## zipcode98034   1.880343e+05  6.265487e+03  3.001110e+01
## zipcode98038   4.003995e+04  6.111575e+03  6.551500e+00
## zipcode98039   1.161125e+06  1.429212e+04  8.124230e+01
## zipcode98040   4.414695e+05  7.521377e+03  5.869530e+01
## zipcode98042   7.755716e+03  6.217227e+03  1.247500e+00
## zipcode98045   1.079694e+05  7.879902e+03  1.370190e+01
## zipcode98052   2.601247e+05  6.140418e+03  4.236270e+01
## zipcode98053   2.380722e+05  6.807783e+03  3.497060e+01
## zipcode98055   4.433038e+04  7.397302e+03  5.992800e+00
## zipcode98056   9.966874e+04  6.603419e+03  1.509350e+01
## zipcode98058   3.461953e+04  6.500455e+03  5.325700e+00
## zipcode98059   9.413628e+04  6.434398e+03  1.463020e+01
## zipcode98065   1.250967e+05  7.114921e+03  1.758230e+01
## zipcode98070   1.020163e+05  9.998474e+03  1.020320e+01
## zipcode98072   1.765421e+05  7.500731e+03  2.353670e+01
## zipcode98074   2.217597e+05  6.567975e+03  3.376380e+01
## zipcode98075   2.238795e+05  6.946997e+03  3.222680e+01
## zipcode98077   1.762090e+05  8.439763e+03  2.087840e+01
## zipcode98092  -2.027686e+04  6.816217e+03 -2.974800e+00
## zipcode98102   3.924664e+05  9.925902e+03  3.953960e+01
## zipcode98103   3.098668e+05  6.338517e+03  4.888630e+01
## zipcode98105   3.767537e+05  8.016207e+03  4.699900e+01
## zipcode98106   1.200166e+05  6.940594e+03  1.729200e+01
## zipcode98107   3.188822e+05  7.604939e+03  4.193090e+01
## zipcode98108   1.287908e+05  8.195875e+03  1.571410e+01
## zipcode98109   4.166551e+05  1.014658e+04  4.106360e+01
## zipcode98112   4.486249e+05  7.741226e+03  5.795270e+01
## zipcode98115   3.176453e+05  6.321733e+03  5.024660e+01
## zipcode98116   2.976759e+05  7.034019e+03  4.231950e+01
## zipcode98117   3.124191e+05  6.377377e+03  4.898870e+01
## zipcode98118   1.578517e+05  6.426202e+03  2.456380e+01
## zipcode98119   4.112068e+05  8.316392e+03  4.944530e+01
## zipcode98122   3.054274e+05  7.401311e+03  4.126670e+01
## zipcode98125   2.000089e+05  6.708031e+03  2.981630e+01
## zipcode98126   1.930074e+05  7.033026e+03  2.744300e+01
## zipcode98133   1.562088e+05  6.396260e+03  2.442190e+01
## zipcode98136   2.532552e+05  7.634826e+03  3.317110e+01
## zipcode98144   2.346370e+05  7.013105e+03  3.345690e+01
## zipcode98146   1.045130e+05  7.415647e+03  1.409360e+01
## zipcode98148   5.178334e+04  1.222571e+04  4.235600e+00
## zipcode98155   1.420096e+05  6.527067e+03  2.175700e+01
## zipcode98166   9.472005e+04  7.456222e+03  1.270350e+01
## zipcode98168   5.465524e+04  7.543170e+03  7.245700e+00
## zipcode98177   2.050970e+05  7.661015e+03  2.677150e+01
## zipcode98178   5.577549e+04  7.577634e+03  7.360500e+00
## zipcode98188   3.133601e+04  9.499668e+03  3.298600e+00
## zipcode98198   2.306371e+04  7.509795e+03  3.071100e+00
## zipcode98199   3.576277e+05  7.171195e+03  4.987000e+01
## sqft_living15  3.636930e+01  1.682500e+00  2.161610e+01
## year           3.921315e+04  4.289947e+03  9.140700e+00
## month02        4.946723e+03  3.970535e+03  1.245900e+00
## month03        2.131728e+04  3.649815e+03  5.840600e+00
## month04        2.693831e+04  3.560046e+03  7.566800e+00
## month05        3.242663e+04  4.709244e+03  6.885700e+00
## month06        4.032933e+04  5.573698e+03  7.235700e+00
## month07        3.609722e+04  5.571068e+03  6.479400e+00
## month08        3.542068e+04  5.623269e+03  6.298900e+00
## month09        3.631760e+04  5.661845e+03  6.414400e+00
## month10        3.302623e+04  5.642017e+03  5.853600e+00
## month11        3.224581e+04  5.781331e+03  5.577600e+00
## month12        3.267243e+04  5.743380e+03  5.688700e+00
## renovated      5.019922e+04  3.359603e+03  1.494200e+01
## age            3.557924e+02  3.768230e+01  9.441900e+00
## 
## Residual standard error: 71780 on 15034 degrees of freedom
hub4_train_pred = predict(house_lm_c1, newdata = train.dat)
hub4_test_pred = predict(house_lm_c1, newdata = test.dat)

hub4_train_results = postResample(pred = hub4_train_pred, obs = train.dat$price)
hub4_test_results = postResample(pred = hub4_test_pred, obs = test.dat$price)
hub4_test_sse = sum((hub4_test_pred - test.dat$price)^2)

# Append to the Results Data Frame
results.df = rbind(results.df,data.frame(model = "Robust Regression (transformed + clean)",
            R.Squared.Train = unname(hub4_train_results[2]),
            R.Squared.Test = unname(hub4_test_results[2]),
            RMSE.test = unname(hub4_test_results[1]),
            SSE.test = hub4_test_sse)
)

Analysis: Both the linear and stepwise model are most the effective at predicting the price when using the test data. Both have r-squared values of 0.77. The Lasso regression model performs the best out of the remedial models. It produces an r-squared value of 0.69. ElasticNet and Ridge regression perform just as well. Robust regression with the Huber weights has an r-squared value of 0.65, not far off from the top three models. Robust regression with the Bisquare weights is where we see a dip in performance with an r-squared value of 0.59. The Weighted Least Squares regression performs the worst with a r-squared value of 0.41.

# Evaluating the performance of the models using the test data set

price.predictors <- colnames(dplyr::select(test.dat.o, -price))

# Predictions for each model using the test dataset
predictions <- data.frame(
  price = test.dat$price,
  price.lm = predict(house_lm1, test.dat)^(1/lambda),
  price.sw.lm = predict(sw_model, test.dat)^(1/lambda),
  price.wls = predict(house_lm_wls, test.dat),
  price.ridge = predict(house_lm_ridge, s = best.lambda.ridge, 
                        newx = data.matrix(test.dat.o[price.predictors])),
  price.lasso = predict(house_lm_lasso, s = best.lambda.lasso, 
                        newx = data.matrix(test.dat.o[price.predictors])),
  price.en = predict(house_lm_enet, s = best.lambda.enet, 
                     newx = data.matrix(test.dat.o[price.predictors])),
  price.huber = predict(house_lm_huber, test.dat),
  price.bisquare = predict(house_lm_bisquare, test.dat)
)

# Function to calculate SSE, R2, MSE, and RMSE
calc_metrics <- function(actual, predicted) {
  sse <- sum((actual - predicted) ^ 2)
  mse <- sse / length(actual)
  rmse <- sqrt(mse) # Calculate RMSE
  sst <- sum((actual - mean(actual)) ^ 2)
  r2 <- 1 - sse / sst
  return(c(SST = sst, SSE = sse, MSE = mse, RMSE = rmse, R2 = r2))
}

# function to each set of predictions
metrics <- data.frame(
  Model = c("Linear after BoxCox", "Stepwise", "Weighted Least Squares", 
            "Ridge", "Lasso", "Elastic Net", "Huber", "Bisquare"),
  do.call(
    rbind, lapply(2:ncol(predictions), 
                  function(i) calc_metrics(predictions$price, predictions[,i])))
)

# Display the metrics table with RMSE
metrics %>%
  dplyr::arrange(desc(R2)) %>%
  knitr::kable(caption = "SST, SSE, MSE, RMSE, and R2 of the Different Models")
SST, SSE, MSE, RMSE, and R2 of the Different Models
Model SST SSE MSE RMSE R2
Stepwise 9.339154e+14 1.601132e+14 24693585535 157141.9 0.8285571
Linear after BoxCox 9.339154e+14 1.603782e+14 24734461027 157271.9 0.8282733
Huber 9.339154e+14 2.024000e+14 31215293967 176678.5 0.7832781
Bisquare 9.339154e+14 2.366529e+14 36497976464 191044.4 0.7466014
Lasso 9.339154e+14 2.815097e+14 43416055977 208365.2 0.6985704
Elastic Net 9.339154e+14 2.815163e+14 43417071476 208367.6 0.6985634
Ridge 9.339154e+14 2.849536e+14 43947191572 209635.9 0.6948829
Weighted Least Squares 9.339154e+14 3.425053e+14 52823147597 229832.9 0.6332588
# This is not working - commenting out temporaryly to improve knit times
#k1<-ols_step_best_subset(house_lm1)
#k1
#plot(k1, guide="none")

V. Challenger Models

In this section, we explore alternative predictive models to challenge our primary regression model. This is crucial for ensuring our final model’s robustness by comparing it against these ‘challenger’ models. Here, we experiment with different modeling techniques such as regression trees, neural networks, or SVMs, evaluating their performance and applicability. This comparative analysis helps in understanding the strengths and weaknesses of various approaches, guiding us to select the most effective model for predicting real estate prices in King County.

Regression Tree Models

This block is designed to determine the optimal depth for a regression tree model predicting house prices. It iterates over a predefined set of depth values, constructing a model at each depth, and calculating MSE and R-squared values for both the test and training datasets. The loop stores these metrics for each depth, and upon completion, it identifies the depth that yields the highest R-squared value on the test data, suggesting the best generalization performance.

set.seed(1023)
n<-dim(df_house_original)[1]
IND<-sample(c(1:n),round(n*0.7))
train.dat.tree <-df_house_original[IND,]
test.dat.tree <-df_house_original[-c(IND),]
depth_values <- c(2:20)

mse_values = numeric(length(depth_values))
test_rsq_values = numeric(length(depth_values))
train_rsq_values = numeric(length(depth_values))

for (i in seq_along(depth_values)) {
  depth = depth_values[i]

  regression_tree_model = rpart(price ~ ., 
                                data = train.dat, 
                                method = "anova", 
                                control=rpart.control(maxdepth=depth))
  predictions_test <- predict(regression_tree_model, newdata = test.dat)
  predictions_train <- predict(regression_tree_model, newdata = train.dat)

  mse_values[i] <- mean((predictions_test - test.dat$price)^2)
  test_rsq_values[i] = cor(predictions_test,test.dat$price)^2
  train_rsq_values[i] = cor(predictions_train,train.dat$price)^2
}

# Find the maximum R-squared value
max_rsq <- max(test_rsq_values)
# Get the corresponding depth value
best_depth <- depth_values[which.max(test_rsq_values)]

cat("Best depth:", best_depth, "with R-squared:", max_rsq)
## Best depth: 5 with R-squared: 0.6890352

Optimized Regression Tree Visualization

This subsection presents the visual depiction of the regression tree model at its calculated optimal complexity. By tuning the tree to the ‘best_depth’, we ensure the model is neither overfitting nor underfitting. The rpart.plot is customized to enhance interpretability, detailing split labels and the count of observations at each node, thereby providing a clear, scaled-up graphical representation of the decision-making process within the model.

Now we use the rpart.plot function to create a detailed plot of the tree, incorporating the optimal tree depth previously determined (best_depth). The plot is configured to show a type 4 tree, which includes split labels, and extra = 1 to display the number of observations in each node. The main title of the plot reflects the chosen depth for easy reference.

# Fitting decision tree with best depth
dtm = rpart(price ~ ., 
              data = train.dat.tree, 
              method = "anova",
              control=rpart.control(maxdepth=best_depth))

# Plot the tree
rpart.plot(dtm, main=paste("Regression Tree - Depth: ", best_depth), 
           type = 4, extra = 1, tweak=1.6)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting

Analysis: The regression tree plot depicts a model of depth 6 (best_depth), indicating a sequence of decisions starting from the root node using features such as ‘grade’, ‘lat’, ‘sqft_above’, and others. Each node represents a condition that splits the data, leading to more homogenous subsets. The leaf nodes represent the predicted price, with the number in each leaf node showing the average price for the observations that follow the path to that node. The tree structure suggests that ‘grade’, ‘sqft_living15’, location (‘waterfront’, ‘lat’ & ‘long’), and ‘sqft_above’ are important predictors, as they appear near the top and on many many different levels of the tree, indicating their significant role in splitting the data and hence in predicting house prices.

Assessing Regression Tree Model Complexity with RMSE

Analysis of the regression tree model’s accuracy across varying depths, utilizing Root Mean Squared Error (RMSE) as the key performance metric.

# Data frame for plotting RMSE results
plot_data <- data.frame(Depth=depth_values, RMSE=mse_values)

# Plot RMSE results vs Depth of the Tree
ggplot(plot_data, aes(x=Depth, y=RMSE)) +
  geom_point(color = "#6baed6") +
  geom_line(color = "#6baed6") +
  labs(x = "Depth Value", y = "Root Mean Squared Error", 
       title = "Regression Tree Cross Validation (RMSE vs depth)") +
  theme_minimal()

Analysis: The plot illustrates how RMSE changes as the depth of the regression tree increases. Initially, RMSE decreases significantly, suggesting improvements in model accuracy with added complexity. However, from a depth of 6 and beyond, the reduction in RMSE plateaus, indicating that increasing the tree depth further provides diminishing returns in terms of model accuracy. This visualization aids in selecting an appropriate model complexity to balance between underfitting and overfitting.

Regression Tree Model Fit Evaluation with R-squared

Influence of tree depth on the regression tree model’s explanatory power, as indicated by the R-squared values.

# Data frame for plotting R-squared results
plot_data <- data.frame(Depth=depth_values, TestR2=test_rsq_values)

# Plot R-squared vs Depth of the Tree
ggplot(plot_data, aes(x=Depth, y=TestR2)) +
  geom_point(color="#6baed6") +
  geom_line(color="#6baed6") +
  geom_text(aes(label=round(TestR2, 4)), vjust=-0.5, color="black") +
  labs(x = "Depth Value", y = "Test R-squared", 
       title = "Regression Tree Cross Validation (R2 vs depth)") +
  theme_minimal() 

Analysis: The plot illustrates the R-squared value at varying tree depths, showing a trend of increasing explanatory power as the depth increases from 2 to 5. The leveling off of R-squared values beyond a depth of 6 suggests that additional depth does not substantially improve the model’s ability to explain the variance in the data, indicating an optimal depth for model complexity.

Prioritizing Features in the Regression Tree Model

This subsection investigates the variable importance derived from the regression tree model, offering a clear depiction of which factors most heavily influence house pricing predictions. The measure of importance is calculated based on the reduction of prediction error attributed to each variable, providing insight into their relative predictive power within the model. This analysis is critical for understanding the key drivers of real estate values and can inform strategic decisions regarding property investments and market evaluations.

imp = dtm$variable.importance
dt_test_pred <- predict(dtm, newdata=test.dat.tree)
dt_train_pred <- predict(dtm, newdata=train.dat.tree)
dt_test_results = postResample(pred = dt_test_pred, obs = test.dat.tree$price)
dt_train_results = postResample(pred = dt_train_pred, obs = train.dat.tree$price)
dt_test_sse = sum((dt_test_pred - test.dat.tree$price)^2)

# Variable importance may be more reliable 
# if considering other values from cross validation
blue_gradient <- colorRampPalette(c("#6baed6", "white"))(length(imp))
barplot(imp, las=2, main="Variable Importance in Decision Tree", 
        col=blue_gradient, cex.names=0.8, cex.axis=0.7, cex.lab=0.7)

Analysis: The bar plot indicates that ‘grade’ has the most significant impact on the model’s decisions, followed by ‘sqft_living15’, and ‘sqft_above’. Variables such as ‘age’, ‘floors’, and ‘renovated’ have minimal impact. This suggests that house quality and living area are the primary drivers of price according to the model, which aligns with real-world expectations of property valuation.

# Append results
results.df = rbind(results.df,data.frame(model = "Decision Tree Regression",
                            R.Squared.Train = unname(dt_train_results[2]),
                            R.Squared.Test = unname(dt_test_results[2]),
                            RMSE.test = unname(dt_test_results[1]),
                            SSE.test = dt_test_sse))

Random Forest Model

The Random Forest model is a powerful ensemble learning method used for both classification and regression tasks. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random Forests correct for decision trees’ habit of overfitting to their training set by introducing randomness in the tree-building process. This randomness can come from building trees on different samples of the data (bootstrap aggregating or bagging) or considering a random subset of features for splitting at each node. The model is robust against overfitting, can handle large datasets with higher dimensionality, and can estimate the importance of variables used in the classification or regression.

hyperparameter_grid <- expand.grid(
  ntree = c(200, 400, 550),  # Different values for ntree
  minsplit = c(4, 6, 8)  # Different values for minsplit
)
#Skip model tuning for computation time
skip_grid = TRUE
if (skip_grid==TRUE){
best_rf = randomForest(price ~ .
-year-renovated-floors-month-day-condition-bedrooms-sqft_basement-bathrooms-sqft_lot15-age-yr_renovated, 
              data = train.dat.tree,
              ntree = 200,
              mtry=7,
              minsplit=4,
              importance=TRUE)

summary(best_rf)

} else{
  
# Initialize model and its RMSE
best_rf <- NULL
best_rmse <- Inf

# initialize scores (RMSE, R-square) 
rmse_values_rf = numeric(nrow(hyperparameter_grid))
test_rsq_values_rf = numeric(nrow(hyperparameter_grid))
train_rsq_values_rf = numeric(nrow(hyperparameter_grid))

# Perform Grid Search to tune ntree (number of trees) and minsplit (minimum value in node to split)
for (i in 1:nrow(hyperparameter_grid)) {
  current_model <- randomForest(
    formula = price ~. -year-renovated-floors-month-day-condition-bedrooms-sqft_basement-bathrooms-sqft_lot15-age-yr_renovated,
    data = train.dat.tree,
    ntree = hyperparameter_grid$ntree[i],
    minsplit = hyperparameter_grid$minsplit[i]
  )

  # Make predictions
  predictions_test <- predict(current_model, test.dat.tree)
  predictions_train <- predict(current_model, train.dat.tree)
  
  # store results
  train_results = postResample(pred = predictions_train, obs = train.dat.tree$price)
  test_results = postResample(pred = predictions_test, obs = test.dat.tree$price)
  
  # RMSE
  current_rmse <- sqrt(mean((predictions_test - test.dat.tree$price)^2))
  
  rmse_values_rf[i] <- current_rmse
  
  # Rsquared
  test_rsq_values_rf[i] = unname(test_results[2])
  train_rsq_values_rf[i] = unname(train_results[2])


  # Check if current model is better than the best model so far
  if (current_rmse < best_rmse) {
    best_rmse <- current_rmse
    best_rf <- current_model
    best_minsplit = hyperparameter_grid$minsplit[i]
  }
}
# Cross validate to tune mtry (number of possible random variables per split)
ctrl = trainControl(method = "cv",  # Cross-validation method
                     number = 5,    # Number of folds
                     verboseIter = TRUE,  # Display iteration progress
                     summaryFunction = defaultSummary)  # For regression

param_grid = expand.grid(mtry=c(8,10,12))

# Finding model
best_rf = train(price ~ . -year-renovated-floors-yr_renovated-month-day-condition-bedrooms-sqft_basement-bathrooms-sqft_lot15-age, 
                  data = train.dat.tree,
                  method = "rf",
                  trControl = ctrl,
                  ntree=200,
                  maxdepth=5,
                  minsplit=4,
                  tuneGrid = param_grid)
}
## 
## Call:
##  randomForest(formula = price ~ . - year - renovated - floors -      month - day - condition - bedrooms - sqft_basement - bathrooms -      sqft_lot15 - age - yr_renovated, data = train.dat.tree, ntree = 200,      mtry = 7, minsplit = 4, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 200
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 16301917289
##                     % Var explained: 87.54

Analysis: This plot depicts the mean squared error (MSE) across the number of trees in the model. The MSE sharply decreases as more trees are added, especially evident in the initial phase with fewer trees. As the number of trees reaches around 50, the decrease in MSE begins to plateau, indicating that adding more trees has diminishing returns on model performance. The model, consisting of 200 trees, explains approximately 86.09% of the variance, showcasing a high level of model accuracy. This suggests that the Random Forest model has a strong predictive capability for the housing price data, with a substantial portion of the variability in the response variable accounted for by the predictors used in the model.

Variable Significance in Random Forest Modeling

This subsection delves into the variable importance generated by the Random Forest model, providing insight into which predictors most significantly impact the target variable, house price. The analysis illustrates the relative influence of each variable through two metrics: the mean decrease in accuracy and the mean decrease in node purity.

# Create a Dataframe with Random Forest variable importance
rf_importance <- as.data.frame(importance(best_rf))

# Create a tidy data frame for ggplot
rf_importance$Variable <- rownames(rf_importance)
rf_importance <- rf_importance %>%
  tidyr::gather(Measure, Value, -Variable)

# Plot RF variable importance
ggplot(rf_importance, aes(x = reorder(Variable, Value), y = Value)) +
  geom_col(fill="#6baed6") +
  facet_wrap(~Measure, scales = "free") +
  coord_flip() +
  theme_minimal() +
  labs(x = NULL, y = NULL, title = "Variable Importance in Random Forest Model") +
  theme(legend.position = "none")

Analysis: This plot provides a clear visualization of the features that have the most substantial impact on predicting house prices. The length of the bars represents the importance of each variable, with ‘grade’ and ‘lat’ being the most significant predictors according to both measures used: the percentage increase in Mean Squared Error (%IncMSE) and Increase in Node Purity (IncNodePurity). This visualization is essential to understand which variables contribute most to the model’s predictive power and potentially guide feature selection.

# Random Forest Test Results
rf_train_pred = predict(best_rf, newdata = train.dat.tree)
rf_test_pred = predict(best_rf, newdata = test.dat.tree)

rf_train_results = postResample(pred = rf_train_pred, obs = train.dat.tree$price)
rf_test_results = postResample(pred = rf_test_pred, obs = test.dat.tree$price)
rf_test_sse = sum((rf_test_pred - test.dat.tree$price)^2)

# Append to the Results Data Frame
results.df = rbind(results.df,data.frame(model = "Random Forest Tuned Model",
            R.Squared.Train = unname(rf_train_results[2]),
            R.Squared.Test = unname(rf_test_results[2]),
            RMSE.test = unname(rf_test_results[1]),
            SSE.test = rf_test_sse)
)

Support Vector Machine (SVM) Model

SVM Model Construction

This subsection discusses the creation of the Support Vector Machine model. It entails the training phase on the dataset, where the SVM algorithm learns to find the best hyperplane that categorizes the data points.

# Build the SVM model
svm_model <- svm(price ~ ., data = train.dat)
print(summary(svm_model))
## 
## Call:
## svm(formula = price ~ ., data = train.dat)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.01052632 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  8628

Analysis: The SVM model summary indicates it’s an epsilon-type regression with a radial basis function kernel. The cost parameter is set to 1, which controls the trade-off between allowing training errors and forcing rigid margins. Gamma, set at approximately 0.0588, defines the influence of a single training example, with lower values meaning ‘far’ and higher values meaning ‘close’. The epsilon of 0.1 specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value. The model has a large number of support vectors, amounting to 9020, which could imply a complex model that may be at risk of overfitting.

SVM Model Evaluation

In this part, we assess the performance of the SVM model using the test dataset. The Root Mean Square Error (RMSE) metric provides insight into the average deviation of the predicted house prices from the actual values.

# Prediction and Performance
svm_predictions <- predict(svm_model, test.dat)
svm_rmse <- sqrt(mean((svm_predictions - test.dat$price)^2))
print(paste("SVM RMSE:", svm_rmse))
## [1] "SVM RMSE: 165183.993174409"

Analysis: The reported RMSE value for the SVM model is $195,393.38, which suggests that on average, the model’s price predictions deviate from the actual sale prices by this amount. Given the high value, this may indicate that the model is not predicting the prices with a high degree of accuracy. In the context of house prices, such a large RMSE could be considered substantial, and it suggests that model parameters may need tuning or the model itself may require a more in-depth evaluation to identify areas of improvement.

SVM Model Visualization

This subsection is dedicated to the visual exploration of the Support Vector Machine model’s predictive performance. Through graphical representations such as scatter plots of predicted versus actual values and histograms of prediction errors, we can intuitively assess the accuracy and distribution of the model’s predictions, and identify patterns or anomalies in the data. These visual analyses are critical for conveying complex statistical results in a clear and actionable format.

# SVM Scatter Plot of Actual vs. Predicted Prices
ggplot(data = data.frame(Actual = test.dat$price, Predicted = svm_predictions), aes(x = Actual, y = Predicted)) +
  geom_point(color = "#6baed6") +
  geom_abline(intercept = 0, slope = 1, color = "black") +
  labs(title = "SVM Actual vs. Predicted Prices", x = "Actual Prices", y = "Predicted Prices") +
  theme_minimal()

Analysis: The SVM Actual vs. Predicted Prices plot shows a positive correlation between the actual and predicted values, yet there is noticeable deviation from the line of perfect fit, especially at higher price points. This deviation contributes to the model’s overall RMSE.

# Create a data frame for the SVM errors
svm_errors <- test.dat$price - svm_predictions
svm_errors_df <- data.frame(Errors = svm_errors)

# Plot Histogram of Prediction Errors
ggplot(svm_errors_df, aes(x=Errors)) +
  geom_histogram(bins = 15, fill="#6baed6", color="black") + # ggplot chooses binwidth
  labs(title="SVM Prediction Error Distribution", x="Prediction Error", y="Count") +
  theme_minimal()

Analysis: The SVM Prediction Error Distribution histogram reveals that most prediction errors cluster around a small range, indicating a concentration of errors close to zero. However, the presence of errors across the scale shows that the model has varying degrees of accuracy for different price levels.

# SVM Plot of Residuals vs. Fitted Values
ggplot(data = data.frame(Predicted = svm_predictions, Residuals = svm_errors), aes(x = Predicted, y = Residuals)) +
  geom_point(color = "#6baed6") +
  geom_hline(yintercept = 0, color = "black") +
  labs(title = "SVM Residuals vs. Predicted", x = "Predicted Prices", y = "Residuals") +
  theme_minimal()

Analysis: The SVM Residuals vs. Predicted plot shows that residuals are not randomly dispersed around the zero line, particularly for higher-priced houses where the model tends to underestimate prices, evident from the residuals’ pattern. This suggests that the model might not capture all the nuances in the data, particularly for properties with higher actual prices.

Neural Network Model

Data Normalization for Neural Network

Preparing the dataset for neural network analysis by normalizing the data. The normalize function adjusts each feature to a common scale, eliminating discrepancies due to different units or scales.

# Data must be normalized for the Neural Network
normalize <- function(x) {
  # Only normalize if x is numeric
  if (is.numeric(x)) {
    return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)))
  } else {
    return(x)
  }
}

# Identify numeric columns
numeric_cols <- sapply(train.dat, is.numeric)

# Apply normalization only to numeric columns
train.dat.norm <- as.data.frame(lapply(train.dat[, numeric_cols], normalize))
test.dat.norm <- as.data.frame(lapply(test.dat[, numeric_cols], normalize))

Neural Network Construction and Architecture

Here, we construct the neural network model using the normalized data and visualize its structure. The neuralnet package is utilized to build the model with a specified architecture and then plot it to understand its configuration.

house_NN <- neuralnet(price ~ ., data = train.dat.norm, hidden = c(2,2), linear.output = TRUE)

Analysis: The neural network diagram represents a model with inputs corresponding to features of the houses such as ‘bedrooms’, ‘bathrooms’, ‘sqft_living’, etc. The two hidden layers with two neurons each suggest an attempt to capture non-linear complexities in the data. The weights, denoted by numbers along the connections, indicate how each input is considered in predicting the house price. Significant weights suggest features that more strongly influence price predictions. Conversely, smaller weights might indicate less influence. The final output is the ‘price’, representing the model’s prediction based on the learned weights through the network’s training.

Performance Analysis of Neural Network

This final section evaluates the neural network’s predictive performance. It involves computing predictions, calculating key performance metrics like RMSE, R-squared, and SSE, and visualizing prediction accuracy and error distribution.

# Now run the compute function
model_results <- compute(house_NN, train.dat.norm)

#model_results <- compute(house_NN, train.dat.norm[-1])
predicted_y <- model_results$net.result
unnormalize <- function(x) {return((x * (max(df_house$price)) -min(df_house$price)) + min(df_house$price))}

pred_new_train <- unnormalize(predicted_y)
PredictedTest<-pred_new_train
NN_train_performance<-data.frame(obs = train.dat.norm$price, pred=PredictedTest)
round(defaultSummary(NN_train_performance),3)
##       RMSE   Rsquared        MAE 
## 564244.622      0.743 468270.089
#performance on test data set
model_results <- compute(house_NN, test.dat.norm[-1])
predicted_y <- model_results$net.result

#transforming back to original scale
pred_new_test <- unnormalize(predicted_y)

PredictedTest<-pred_new_test
NN_test_performance<-data.frame(obs = test.dat.norm$price, pred=PredictedTest)
round(defaultSummary(NN_test_performance),3)
##       RMSE   Rsquared        MAE 
## 537381.270      0.716 457878.780
rmse_value <- rmse(pred_new_test, test.dat.norm$price)
r_squared_value <- r_squared(pred_new_test, test.dat.norm$price)
sse_value <- sse(pred_new_test, test.dat.norm$price)

cat("RMSE:", rmse_value, "\n")
## RMSE: 537381.3
cat("R-squared:", r_squared_value, "\n")
## R-squared: 0.7163465
cat("SSE:", sse_value, "\n")
## SSE: 1.872441e+15

Analysis: Given the RMSE of 0.04553926, the model’s predictions are relatively close to the actual prices, but there’s room for improvement, especially considering the R-squared value of 0.3436959, which suggests that only about 34% of the variance in the house prices is being explained by the model. The SSE of 13.44667 further indicates a substantial sum of errors squared, calling for model refinement to better capture complex patterns in the data.

model_results <- compute(house_NN, test.dat.norm[-1])
predicted_price <- model_results$net.result
cor(predicted_price, test.dat$price)
##           [,1]
## [1,] 0.8463726

Analysis: The histogram of prediction errors displays a concentration around zero, indicating most predictions are close to the actual values, but the spread towards the right suggests a skew in overestimating house prices.

plot(test.dat.norm$price, predicted_price, 
     main = "Actual vs. Predicted Prices", 
     xlab = "Actual Prices", ylab = "Predicted Prices")
abline(0, 1, col = "red")

Analysis: The scatter plot of actual vs. predicted prices shows a cluster below the ideal 45-degree line, reinforcing that the model tends to underpredict the higher-valued houses.

VI. Model Limitation and Assumptions

Based on the performances on both train and test data sets, determine your primary (champion) model and the other model which would be your benchmark model. Validate your models using the test sample. Do the residuals look normal? Does it matter given your technique? How is the prediction performance using Pseudo R^2, SSE, RMSE? Benchmark the model against alternatives. How good is the relative fit? Are there any serious violations of the model assumptions? Has the model had issues or limitations that the user must know? (Which assumptions are needed to support the Champion model?)

# Results dataframe

# We're supposed to use Pseudo R-squared, SSE, RMSE, as seen above. 
# We'll have to look into 'Pseudo R-squared' most likely

results.df

VII. Ongoing Model Monitoring Plan

How would you picture the model needing to be monitored, which quantitative thresholds and triggers would you set to decide when the model needs to be replaced? What are the assumptions that the model must comply with for its continuous use?

For model monitoring, we need to: 1. regularly check for changes in the distribution of input or target variables, track the importance of variables, monitor model stability, and check for the presence of outliers in new data that might affect model performance. 2. define the key performance metrics for regression models e.g Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-squared, SSE, Confusion Matrix, F1 score, etc; and set thresholds for these metrics based on the acceptable level of model performance. 3. periodically validate the model on new data to ensure it generalizes well. Cross-validation can be implement to assess model performance on unseen data. 4. be mindful of changes in the business environment, processes, customer behavior, or external/internal factors may affect the model’s performance. We can setup alerts to notify stakeholders when model performance drops below acceptable levels 5. maintain detailed documentation of the model, including its assumptions, methodologies, and any updates made. 5. regularly assess the model for bias that may affect certain demographic groups unfairly, and also ensure that the model complies with any regulatory requirements.

The model must comply with various assumptions for regression model to hold such as linearity, independence, homoscedasticity, normality of residuals, etc

VIII. Conclusion

Summarize your results here. What is the best model for the data and why?

Bibliography

Please include all references, articles and papers in this section.

Appendix

The Appendix provides additional detailed analyses and visual representations that complement the main body of the report. This section includes further explorations of model variations, extended error assessments, and a complete overview of performance metrics, enriching the reader’s comprehension of the research findings.

Enhanced Model Visualizations

Neurla Network

plot(house_NN, main="Neural Network Architecture Visualization")

library(NeuralNetTools)

Regression Tree Depth Variations

Supplementary plots showcasing regression trees of various depths that were not included in the main analysis. These visualizations represent alternative models with simplified complexities, providing a broader understanding of the model’s behavior with less granular splitting. They serve as a reference for how the model’s structure changes with depth, offering additional context to the primary findings presented in the report.

# Plot the least significant trees from the regresssion tree model
depth_values <- c(2, 3, 4, 5)

for (i in seq_along(depth_values)) {
  depth = depth_values[i]

  tree_model = rpart(price ~ ., 
                     data = train.dat, 
                     method = "anova", 
                     control=rpart.control(maxdepth=depth))
  
  # Less significant trees that were not plotted in the model analysis
  rpart.plot(tree_model, main=paste("Regression Tree - Depth: ", depth), 
             type = 4, extra = 1, tweak=1.6)
}
## Warning: labs do not fit even at cex 0.15, there may be some overplotting

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

Additional Predictive Accuracy Charts

Neural Network

Analysis: The scatter plot of actual vs. predicted prices shows a cluster below the ideal 45-degree line, reinforcing that the model tends to underpredict the higher-valued houses.

plot(test.dat.norm$price, predicted_price, 
     main = "Actual vs. Predicted Prices", 
     xlab = "Actual Prices", ylab = "Predicted Prices")
abline(0, 1, col = "red")

Neural Network Predictive Performance and Error Distribution

A dual-faceted visual evaluation of the neural network model. The first plot highlights the spread and central tendency of predictive errors, revealing the model’s precision range. The second plot maps predicted values against actual prices, offering a direct visual assessment of accuracy, with the proximity to the diagonal indicating the model’s effectiveness in capturing the underlying price determinants. Together, these plots form a comprehensive view of the model’s prediction capabilities and areas for improvement.

# Create a data frame for the Neural Network errors
nn_errors <- test.dat.norm$price - predicted_price
nn_errors_df <- data.frame(Errors = nn_errors)

# Plot Histogram of Prediction Errors
ggplot(nn_errors_df, aes(x=Errors)) +
  geom_histogram(bins = 15, fill="#6baed6", color="black") +
  labs(title="NN Prediction Error Distribution", 
       x="Prediction Error", y="Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Analysis: The histogram of prediction errors displays a concentration around zero, indicating most predictions are close to the actual values, but the spread towards the right suggests a skew in overestimating house prices.

Extended Data Analysis Support

Comprehensive Error Distribution Review

Full Model Performance Metrics

“Box-Cox Transformations.” n.d. https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Introductory_Statistics_(Lane)/16%3A_Transformations/16.04%3A_Box-Cox_Transformations.
Faraway, Julian J. 2014. Linear Models with r. 2nd ed. Chapman & Hall/CRC Texts in Statistical Science.
Kutner, Michael H., Christopher J. Nachtsheim, John Neter, and William Li. 2005. Applied Linear Statistical Models. 5th ed. McGraw-Hill.